KNIME news, usage, and development

The Node Guide. Finding Help in KNIME Analytics Platform and on the Web

Did you ever ask yourself, while using KNIME Analytics Platform, “What should I do next?” Or “How can I use this node?” Or “What on earth is this parameter for?!” No matter if you are new to  KNIME or already an expert, I’m sure you have asked these questions sometimes and might still be wondering about them.

There are many ways already to find the answers.

  • Read the node documentation provided in the KNIME workbench
  • Go to the KNIME forum to ask for advice about how a node can be used
  • Check out the EXAMPLES Server for demos on particular use cases, with detailed annotations and organized categories.

In addition to these three options, we have recently added new help content to our website and to KNIME Analytics Platform…

…Please welcome the Node Guide! Together with the Workflow Coach, they are your new best friends. The information provided here is intended for beginners and experienced users of KNIME alike.

The Node Guide is a searchable web reference for nodes, with example workflows that demonstrate how they are used. You might have used it already. In fact, Node Guide pages now pop up in search engines as answers to KNIME node questions; questions in the KNIME Forum are increasingly answered with links to Node Guide pages; and a new entry in the Learning section of the KNIME web site takes you directly to the Node Guide page.

Will they blend? Experiments in Data & Tool Blending. Today: Twitter meets PostgreSQL. More than idle chat?

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: Twitter meets PostgreSQL. More than idle chat?

The Challenge

Today we will trespass into the world of idle chatting. Since Twitter is, as everybody knows THE place for idle chatting, our blending goal for today is a mini-DWH application to archive tweets day by day. The tweet topic can be anything, but for this experiment we investigate what people say about #KNIME through the word cloud from last month’s tweets.

Now, if you connect to Twitter with a free developer account, you will only receive the most recent tweets about the chosen topic. If you are interested in all tweets from last month for example, you need to regularly download and store them into an archive. That is, you need to build a Data WareHousing (DWH) application to archive past tweets.

As archival tool for this experiment, we chose a PostgreSQL database. The DWH application at the least should download and store yesterday’s tweets. As a bonus, it could also create the word cloud from all tweets posted in the past month. That is, it should combine yesterday’s tweets from Twitter and last month’s tweets from the archive to build the desired word cloud.  

Summarizing, on one side, we collect yesterday’s tweets directly from Twitter, on the other side we retrieve past tweets from a PostgreSQL database. Will they blend?

Topic. What people say about #KNIME on Twitter.

Challenge. Blend together tweets from Twitter and tweets from PostgreSQL database and draw the corresponding word cloud.

Access Mode. Twitter access and Database access.

Will they blend? Experiments in Data & Tool Blending. Today: Hadoop Hive meets Excel. Your flight is boarding now

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: Hadoop Hive meets Excel. Your flight is boarding now

The Challenge

Today’s challenge is weather-based - and something we’ve all experienced ourselves while traveling. How are flight departures at US airports impacted by changing weather patterns? What role do weather / temperature fluctuations play in delaying flight patterns?

We’re sure the big data geeks at the big airlines have their own stash of secret, predictive algorithms but we can also try to figure this out ourselves. To do that, we first need to combine weather information with flight departure data.

On the one hand, we have a whole archive of US flights over the years, something in the order of millions of records, which we have saved on a big data platform, such as Hadoop Hive. On the other, we have daily US weather information downloadable from https://www.ncdc.noaa.gov/cdo-web/datasets/ in the form of Excel files. So, a Hadoop parallel platform on one side and traditional Excel spreadsheets on the other. Will they blend?

Topic. Exploring correlations between flights delays and weather variables.

Challenge. Blend data from Hadoop Hive and Excel files.

Access Mode. Connection to Hive with in-database processing and Excel file reading.

Will they blend? Experiments in Data & Tool Blending. Today: Open Street Maps (OSM) meets CSV Files and Google Geocoding API

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: Open Street Maps (OSM) meets CSV Files and Google Geocoding API

The Challenge

Today’s challenge is a geographical one. Do you know which cities are the most populated cities in the world? Do you know where they are? China? USA? By way of contrast, do you know which cities are the smallest cities in the world?

Today we want to show you where you can find the largest and the smallest cities in the world by population on a map. While there is general agreement from trustworthy sources on the web about which are the most populated cities, agreement becomes sparser when looking for the smallest cities in the world. There is general agreement though about which ones are the smallest capitals in the world.

We collected data for the 125 world’s largest cities in a CSV text file and data for the 10 smallest capitals of equally small and beautiful countries in another CSV text file. Data includes city name, country, size in squared kilometers, population number, and population density. The challenge of today is to localize such cities on a world map. Technically this means:

  • To blend the city data from the CSV file with the city geo-coordinates from the Google Geocoding API into KNIME Analytics Platform
  • Then to blend the ETL and machine learning from KNIME Analytics Platform with the geographical visualization of Open Street Maps.

Topic. Geo-localization of cities on a world map.

Challenge. Blend city data from CSV files and city geo-coordinates from Google Geocoding API and display them on a OSM world map.

Access Mode. CSV file and REST service for Google Geocoding API.

Integrated Tool. Open Street Maps (OSM) for data visualization.

Will they blend? Experiments in Data & Tool Blending. Today: IBM Watson meets Google API

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: IBM Watson meets Google API

The Challenge

It’s said that you should always get your news from several, different sources, then compare and contrast to form your own independent opinion. At the point of this writing we are only days away from the US election and all the news we could find are articles about the election race between Hillary Clinton and Donald Trump. Our question then is: What has Obama been doing?

So what about blending IBM Watson’s News Service with Google News to find out?

My free subscription for IBM Watson is running out in a few days, so I’d better hurry up to experiment how I can use it inside KNIME Analytics Platform. Google News on the other side is a free limited service.

Let’s see what happens when we blend IBM Watson News and Google News within KNIME Analytics Platform. Shall we?

Topic. Barack Obama in the news.

Challenge. Extract and combine news headlines from Google News and IBM Watson News.

Access Mode. REST service for both Google API and IBM Watson.

Just Add Imagination…

A short description of data analytics project formalization
through some of the whitepapers developed over time here at KNIME.com AG

It is not hard nowadays to find talks from conferences and blog posts on the web claiming that data analytics, or data science as it is now called, can do wonders for your company. Sure! However, identification of the relevant problems and their formalization into available data vs. the desired output remain the biggest obstacles to a realistic implementation of any data-driven project.

Indeed, the most important part of a data analytics project is always at the beginning, when the problem to be solved is selected and formally defined. Problems usually are many. How do we select the most remunerative and the least work intensive?

I encounter this blockage often when I’m out giving presentations and discussing data analytics strategies. So, I thought, it might be useful to describe the path followed in some past projects. Of course, the ones presented here are actually just a subset of the many solutions built and refined with KNIME Analytics Platform.

You can find them

Integrating One More Data Source: The Semantic Web

The Semantic Web

According to the W3C Linked Data page, the Semantic Web refers to a technology stack to support the “Web of data”. Semantic Web technologies enable people to create data stores on the Web, build vocabularies, and write rules for handling data. Linked data are empowered by technologies such as RDF, SPARQL, OWL, and SKOS.

  • RDF. Resource Description Framework is a standard data model for representing the metadata of resources in the Web; it represents all resources - even those that cannot be directly retrieved. RDF especially helps to process, mix, expose, and share such metadata. In terms of the relational model, an RDF statement specifies a relationship between two resources and it is similar to a triple relation with subject, predicate, and object.
  • OWL. Ontology Web Language is based on the basic elements of RDF, but uses a wider vocabulary to describe properties and classes.
  • SKOS. Simple Knowledge Organization System is also based on RDF and specifically designed to express hierarchical information. If needed, it is also extendable into OWL.
  • SPARQL. Simple Protocol and RDF Query Language is an RDF-based query language used to retrieve and manipulate public and private metadata stored in RDF format.

A commonly used instance of the semantic web is the DBPedia project, which was created to extract structured content from Wikipedia.

Our latest release KNIME Analytics Platform 3.2 includes a great feature: semantic web integration! A full node category is dedicated to querying and manipulating semantic web resources. The new semantic web nodes treat the web of data exactly like a database, with connector nodes, query nodes, and manipulation nodes. Additional nodes are provided to read and write files in various formats.

10 years of KNIME, 10 years of innovation

Here we are, 10 years later! It has been an incredible journey, both challenging and rewarding at the same time. Starting from an embryo idea in 2006, to make data analytics available and affordable to every data scientist in the world, we have embarked on this adventure with undefined expectations about the future. As you can often judge a book from its incipit, those initial steps gave some early indications about what the KNIME platform and the KNIME company would bring.

KNIME philosophy

KNIME’s philosophy has always been to build an open platform for data science people. The first interpretation of such thought was to create an affordable and easy to use tool. Of course, in order to be of any use to a data scientist, such a tool would also have to cover as much as possible of machine learning, statistics, ETL, data blending, and data visualization. Along the same lines, it must be able to import data from as many data sources as possible, and - why not? - from as many different data science tools as possible. Later on, our open platform philosophy progressed into the generation of a team collaboration tool. Times have changed and a single isolated data scientist is not enough anymore; teams of data scientists and data analysts working together are needed. And of course, we couldn’t neglect speed and agile prototyping. So many challenges, all to be satisfied in a quick and elegant way!

This constant attention to the needs of data science teams around the world combined with the courage to embrace openness and integration has made KNIME one of the most innovative companies in the field of data science over the last 10 years. Today, we are proud to confirm that the choice of building an open platform has led to a completely new way of managing data analytics, a way that is friendly, approachable, easy, extensive, powerful, inclusive, affordable, collaborative, and graphical. A way that many have tried to copy, but never really equaled. This might also be because KNIME is primarily motivated by passion and science rather than money and revenues.

Of course, this KNIME philosophy was not shaped in one day. It has taken 10 years to develop and formalize it to the current point. But all the ingredients were already present right at the beginning. This blog post takes you on this 10-year journey.

Customer Segmentation comfortably from a Web Browser

Definition of Customer Segments

Customer segmentation has undoubtedly been one of the most implemented applications in data analytics since the birth of customer intelligence and CRM data.

The concept is simple. Group your customers together based on some criteria, such as revenue creation, loyalty, demographics, buying behavior, or any combination of these criteria, and more.

The group (or segment) can be defined in many ways, depending on the data scientist’s degree of expertise and domain knowledge.

  1. Grouping by rules. Somebody in the company already knows how the system works and how the customers should be grouped together with respect to a given task, e.g. a campaign. A Rule Engine node would suffice to implement this set of experience-based rules. This approach is highly interpretable, but not very portable to new analysis. In the presence of a new goal, new knowledge, or new data the whole rule system needs to be redesigned.
  2. Grouping as binning. Sometimes the goal is clear and not negotiable. One of the many features describing our customers is selected as the representative one, be it revenues, loyalty, demographics, or anything else. In this case, the operation of segmenting the customers in groups is reduced to a pure binning operation. Here customer segments are built along one or more attributes by means of bins. This task can be implemented easily, using one of the many binner nodes available in KNIME Analytics Platform.
  3. Grouping with zero knowledge. We can assume that the data scientist frequently does not know enough of the business at hand to build his own customer segmentation rules. In this case, if no business analyst is around to help, he should resolve to a plain blind clustering procedure. The after-work for the cluster interpretation belongs to a business analyst, who is (or should be) the domain expert.

With the set goal of making this workflow suitable for a number of different use cases, we chose the third option.  

There are many clustering procedures and KNIME Analytics Platform makes them available in the Node Repository panel, in the category Analytics/Mining/Clustering, e.g. k-Means, nearest neighbors, DBSCAN, hierarchical clustering, SOTA, etc … We went for the most commonly used: the k-Means algorithm.

Not all the REST, but a bit more of it in KNIME Server 4.3

The latest version of KNIME Server 4.3 brings some additions to its REST interface. In this article I will present some of them and how they can be used by client programs. Before we start I should mention that the Mason specification that we are using as the response format has changed slightly and we have adapted KNIME Server accordingly. You may want to have a look at the current version in case you have been using the Mason metadata.

Up and down

The most important addition is the possibility to up- and download items from the workflow repository. The usage is straightforward once you know the addresses. If you read our previous articles on the REST interface ([1], [2]), you may recall that the the workflow repository can be browsed. The entry point for the repository is "/rest/v4/repository/". The meta data of items in the repository can be queried by simply appending the item's path to the base address, e.g. "/rest/v4/repository/workflow", and issuing a GET request. This will return the name, owner, type, as well as some other properties. Note that this also means that the same address cannot be used to access the actual contents of the item. Instead you have to append ":data" to the item's URL and issue a GET request: GET http:///server/knime/rest/v4/repository/workflow:data. You will also find this address in the @control block of the Mason structure for the particular item (but only if you have permissions to download the item). The result will be a ZIP archive that contains the workflow, data file, or even the whole workflow group. The URL takes an optional query parameter "compressionLevel" with which you can control the compression on the server side (value between 0-9, the default is 1). As any browser can issue a GET request when you paste an URL into the address field, you can easily try this out yourself.

Upload works in the very same way, except that the HTTP method that is used is PUT. The expected request content is also a ZIP archive containing the item to be uploaded. Please note that the request's content type must be "application/zip" otherwise the server will reject it. The ZIP archive is expected to contain exactly one entry at the root level, either the data file or the folder of the workflow or workflow group.