KNIME general.

Will They Blend? Experiments in Data & Tool Blending. Today: MS Word meets Web Crawling. Identifying the Secret Ingredient

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: MS Word meets Web Crawling. Identifying the Secret Ingredient

Authors: Roland Burger and Heather Fyson

The Challenge

It’s Christmas again, and like every Christmas, I am ready to bake my famous Christmas cookies. They’re great and have been praised by many! My recipes are well-kept secrets as I am painfully aware of the competition. There are recipes galore on the web nowadays. I need to find out more about these internet-hosted recipes and their ingredients, particularly any ingredient I might have overlooked over the years.

My recipes are stored securely in an MS Word document on my computer, while a very well regarded web site for cookie recipes is http://www.handletheheat.com/peanut-butter-snickerdoodles/. I need to compare my own recipe with this one on the web and find out what makes them different. What is the secret ingredient for the ultimate Christmas cookie?

In practice, on one side I want to read and parse my Word recipe texts and on the other side I want to use web crawling to read and parse this web-hosted recipe. The question as usual is: will they blend?

Topic. Christmas cookies.

Challenge. Identifying secret ingredients in cookie recipes from the web by comparing them to my own recipes stored locally in an MS Word document.

Access Mode. MS Word .docx file reader and parser and web crawler nodes.

Will they blend? Experiments in Data & Tool Blending. Today: A Cross-Platform Ensemble Model: R meets Python and KNIME. Embrace Freedom in the Data Science Lab

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: A Cross-Platform Ensemble Model: R meets Python and KNIME. Embrace Freedom in the Data Science Lab

The Challenge

Today’s challenge consists of building a cross-platform ensemble model. The ensemble model must collect a Support Vector Machine (SVM), a logistic regression, and a decision tree.  Let’s raise the bar even more and train these models on different analytics platforms: R, Python, and of course KNIME. (Note that we, of course, could create all those models in KNIME but that would kill the rest of the story...)

A small group of three data scientists was given the task to predict flight departure delays from Chicago O’Hare (ORD) airport, based on the airline data set. As soon as the data came in, all data scientists built a model in record time. I mean, each one of them built a different model on a different platform! We ended up with a Python script to build a logistic regression, an R script to build an SVM, and a KNIME workflow to train a decision tree. Which one should we choose?

We had two options here: select the best model and claim the champion; embrace diversity and build an ensemble model. Since more is usually better than less, we opted for the ensemble model. Thus, we just needed to convince two out of the three data scientists to switch analytics platform.

Or maybe not.

Thanks to its open architecture, KNIME can easily integrate R and Python scripts. In this way, every data scientist can use his/her preferred analytics platform, while KNIME collects and fuses the results.

Today’s challenge has three main characters: a decision tree built on KNIME Analytics Platform, an SVM built in R, and a logistic regression built with Python. Will they blend?

Topic. Flight departure delays from Chicago O’Hare (ORD) Airport.

Challenge. Build a cross-platform ensemble model, by blending an R SVM, a Python logistic regression, and a KNIME decision tree.

KNIME Extensions. Python and R Integrations.

The Node Guide. Finding Help in KNIME Analytics Platform and on the Web

Did you ever ask yourself, while using KNIME Analytics Platform, “What should I do next?” Or “How can I use this node?” Or “What on earth is this parameter for?!” No matter if you are new to  KNIME or already an expert, I’m sure you have asked these questions sometimes and might still be wondering about them.

There are many ways already to find the answers.

  • Read the node documentation provided in the KNIME workbench
  • Go to the KNIME forum to ask for advice about how a node can be used
  • Check out the EXAMPLES Server for demos on particular use cases, with detailed annotations and organized categories.

In addition to these three options, we have recently added new help content to our website and to KNIME Analytics Platform…

…Please welcome the Node Guide! Together with the Workflow Coach, they are your new best friends. The information provided here is intended for beginners and experienced users of KNIME alike.

The Node Guide is a searchable web reference for nodes, with example workflows that demonstrate how they are used. You might have used it already. In fact, Node Guide pages now pop up in search engines as answers to KNIME node questions; questions in the KNIME Forum are increasingly answered with links to Node Guide pages; and a new entry in the Learning section of the KNIME web site takes you directly to the Node Guide page.

Will they blend? Experiments in Data & Tool Blending. Today: Twitter meets PostgreSQL. More than idle chat?

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: Twitter meets PostgreSQL. More than idle chat?

The Challenge

Today we will trespass into the world of idle chatting. Since Twitter is, as everybody knows THE place for idle chatting, our blending goal for today is a mini-DWH application to archive tweets day by day. The tweet topic can be anything, but for this experiment we investigate what people say about #KNIME through the word cloud from last month’s tweets.

Now, if you connect to Twitter with a free developer account, you will only receive the most recent tweets about the chosen topic. If you are interested in all tweets from last month for example, you need to regularly download and store them into an archive. That is, you need to build a Data WareHousing (DWH) application to archive past tweets.

As archival tool for this experiment, we chose a PostgreSQL database. The DWH application at the least should download and store yesterday’s tweets. As a bonus, it could also create the word cloud from all tweets posted in the past month. That is, it should combine yesterday’s tweets from Twitter and last month’s tweets from the archive to build the desired word cloud.  

Summarizing, on one side, we collect yesterday’s tweets directly from Twitter, on the other side we retrieve past tweets from a PostgreSQL database. Will they blend?

Topic. What people say about #KNIME on Twitter.

Challenge. Blend together tweets from Twitter and tweets from PostgreSQL database and draw the corresponding word cloud.

Access Mode. Twitter access and Database access.

Will they blend? Experiments in Data & Tool Blending. Today: Hadoop Hive meets Excel. Your flight is boarding now

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: Hadoop Hive meets Excel. Your flight is boarding now

The Challenge

Today’s challenge is weather-based - and something we’ve all experienced ourselves while traveling. How are flight departures at US airports impacted by changing weather patterns? What role do weather / temperature fluctuations play in delaying flight patterns?

We’re sure the big data geeks at the big airlines have their own stash of secret, predictive algorithms but we can also try to figure this out ourselves. To do that, we first need to combine weather information with flight departure data.

On the one hand, we have a whole archive of US flights over the years, something in the order of millions of records, which we have saved on a big data platform, such as Hadoop Hive. On the other, we have daily US weather information downloadable from https://www.ncdc.noaa.gov/cdo-web/datasets/ in the form of Excel files. So, a Hadoop parallel platform on one side and traditional Excel spreadsheets on the other. Will they blend?

Topic. Exploring correlations between flights delays and weather variables.

Challenge. Blend data from Hadoop Hive and Excel files.

Access Mode. Connection to Hive with in-database processing and Excel file reading.

Will they blend? Experiments in Data & Tool Blending. Today: Open Street Maps (OSM) meets CSV Files and Google Geocoding API

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: Open Street Maps (OSM) meets CSV Files and Google Geocoding API

The Challenge

Today’s challenge is a geographical one. Do you know which cities are the most populated cities in the world? Do you know where they are? China? USA? By way of contrast, do you know which cities are the smallest cities in the world?

Today we want to show you where you can find the largest and the smallest cities in the world by population on a map. While there is general agreement from trustworthy sources on the web about which are the most populated cities, agreement becomes sparser when looking for the smallest cities in the world. There is general agreement though about which ones are the smallest capitals in the world.

We collected data for the 125 world’s largest cities in a CSV text file and data for the 10 smallest capitals of equally small and beautiful countries in another CSV text file. Data includes city name, country, size in squared kilometers, population number, and population density. The challenge of today is to localize such cities on a world map. Technically this means:

  • To blend the city data from the CSV file with the city geo-coordinates from the Google Geocoding API into KNIME Analytics Platform
  • Then to blend the ETL and machine learning from KNIME Analytics Platform with the geographical visualization of Open Street Maps.

Topic. Geo-localization of cities on a world map.

Challenge. Blend city data from CSV files and city geo-coordinates from Google Geocoding API and display them on a OSM world map.

Access Mode. CSV file and REST service for Google Geocoding API.

Integrated Tool. Open Street Maps (OSM) for data visualization.

Will they blend? Experiments in Data & Tool Blending. Today: IBM Watson meets Google API

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: IBM Watson meets Google API

The Challenge

It’s said that you should always get your news from several, different sources, then compare and contrast to form your own independent opinion. At the point of this writing we are only days away from the US election and all the news we could find are articles about the election race between Hillary Clinton and Donald Trump. Our question then is: What has Obama been doing?

So what about blending IBM Watson’s News Service with Google News to find out?

My free subscription for IBM Watson is running out in a few days, so I’d better hurry up to experiment how I can use it inside KNIME Analytics Platform. Google News on the other side is a free limited service.

Let’s see what happens when we blend IBM Watson News and Google News within KNIME Analytics Platform. Shall we?

Topic. Barack Obama in the news.

Challenge. Extract and combine news headlines from Google News and IBM Watson News.

Access Mode. REST service for both Google API and IBM Watson.

Just Add Imagination…

A short description of data analytics project formalization
through some of the whitepapers developed over time here at KNIME.com AG

It is not hard nowadays to find talks from conferences and blog posts on the web claiming that data analytics, or data science as it is now called, can do wonders for your company. Sure! However, identification of the relevant problems and their formalization into available data vs. the desired output remain the biggest obstacles to a realistic implementation of any data-driven project.

Indeed, the most important part of a data analytics project is always at the beginning, when the problem to be solved is selected and formally defined. Problems usually are many. How do we select the most remunerative and the least work intensive?

I encounter this blockage often when I’m out giving presentations and discussing data analytics strategies. So, I thought, it might be useful to describe the path followed in some past projects. Of course, the ones presented here are actually just a subset of the many solutions built and refined with KNIME Analytics Platform.

You can find them

Integrating One More Data Source: The Semantic Web

The Semantic Web

According to the W3C Linked Data page, the Semantic Web refers to a technology stack to support the “Web of data”. Semantic Web technologies enable people to create data stores on the Web, build vocabularies, and write rules for handling data. Linked data are empowered by technologies such as RDF, SPARQL, OWL, and SKOS.

  • RDF. Resource Description Framework is a standard data model for representing the metadata of resources in the Web; it represents all resources - even those that cannot be directly retrieved. RDF especially helps to process, mix, expose, and share such metadata. In terms of the relational model, an RDF statement specifies a relationship between two resources and it is similar to a triple relation with subject, predicate, and object.
  • OWL. Ontology Web Language is based on the basic elements of RDF, but uses a wider vocabulary to describe properties and classes.
  • SKOS. Simple Knowledge Organization System is also based on RDF and specifically designed to express hierarchical information. If needed, it is also extendable into OWL.
  • SPARQL. Simple Protocol and RDF Query Language is an RDF-based query language used to retrieve and manipulate public and private metadata stored in RDF format.

A commonly used instance of the semantic web is the DBPedia project, which was created to extract structured content from Wikipedia.

Our latest release KNIME Analytics Platform 3.2 includes a great feature: semantic web integration! A full node category is dedicated to querying and manipulating semantic web resources. The new semantic web nodes treat the web of data exactly like a database, with connector nodes, query nodes, and manipulation nodes. Additional nodes are provided to read and write files in various formats.

10 years of KNIME, 10 years of innovation

Here we are, 10 years later! It has been an incredible journey, both challenging and rewarding at the same time. Starting from an embryo idea in 2006, to make data analytics available and affordable to every data scientist in the world, we have embarked on this adventure with undefined expectations about the future. As you can often judge a book from its incipit, those initial steps gave some early indications about what the KNIME platform and the KNIME company would bring.

KNIME philosophy

KNIME’s philosophy has always been to build an open platform for data science people. The first interpretation of such thought was to create an affordable and easy to use tool. Of course, in order to be of any use to a data scientist, such a tool would also have to cover as much as possible of machine learning, statistics, ETL, data blending, and data visualization. Along the same lines, it must be able to import data from as many data sources as possible, and - why not? - from as many different data science tools as possible. Later on, our open platform philosophy progressed into the generation of a team collaboration tool. Times have changed and a single isolated data scientist is not enough anymore; teams of data scientists and data analysts working together are needed. And of course, we couldn’t neglect speed and agile prototyping. So many challenges, all to be satisfied in a quick and elegant way!

This constant attention to the needs of data science teams around the world combined with the courage to embrace openness and integration has made KNIME one of the most innovative companies in the field of data science over the last 10 years. Today, we are proud to confirm that the choice of building an open platform has led to a completely new way of managing data analytics, a way that is friendly, approachable, easy, extensive, powerful, inclusive, affordable, collaborative, and graphical. A way that many have tried to copy, but never really equaled. This might also be because KNIME is primarily motivated by passion and science rather than money and revenues.

Of course, this KNIME philosophy was not shaped in one day. It has taken 10 years to develop and formalize it to the current point. But all the ingredients were already present right at the beginning. This blog post takes you on this 10-year journey.