KNIME news, usage, and development

Will They Blend? Experiments in Data & Tool Blending. Kindle epub meets image JPEG: Will KNIME make peace between the Capulets and the Montagues?

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: Kindle epub meets image JPEG: Will KNIME make peace between the Capulets and the Montagues?

Authors: Heather Fyson and Kilian Thiel

The Challenge

A plague o’ both your houses! They have made worms’ meat of me!” said Mercutio in Shakespeare’s “Romeo and Juliet” – in which tragedy results from the characters’ inability to communicate effectively. It is worsened by the fact that Romeo and Juliet each come from the feuding “two households”: Romeo a Montague and Juliet, a Capulet.

For this blog article, we decided to take a look at the interaction between the characters in the play by analyzing the script – an epub file – to see just who talks to who. Are the Montagues and Capulets really divided families? Do they really not communicate? To make the results easier to read, we decided to visualize the network as a graph, with each node in the graph representing a character in the play and showing an image of the particular character.

The “Romeo and Juliet” e-book is downloadable for free in a number of formats from the Gutenberg Project web site. For this experiment, we downloaded the epub file. epub is an e-book file format used in many e-reading devices, such as Amazon Kindle for example (for more information about the epub format, check https://en.wikipedia.org/wiki/EPUB).

The images for the characters of the Romeo and Juliet play have been kindly made available by Stadttheater Konstanz in a JPEG format from a live show. JPEG is a commonly used format to store images (for more information about the JPEG format, check https://en.wikipedia.org/wiki/JPEG).

Unlike the Montague and the Capulet families – will epub and JPEG files blend?

Topic. Analyzing the graph structure of the dialogs in Shakespeare’s tragedy “Romeo and Juliet”.

Challenge. Blending epub and JPEG files and combining text mining and network visualization.

Access Mode. epub parser and JPEG reader.

Creating RESTful services with KNIME, an update

Creating a service

We’ve had a couple of posts in the past about creating RESTful services with the KNIME Analytics Platform and using the REST API provided by the KNIME Server. But since we keep adding functionality and making things easier, it’s worthwhile to occasionally come back and revisit the topic. This post will demonstrate a couple of changes since Jon’s last update.

We’ll start with this simple workflow, which finds the top N rows of a database by sorting on a particular column and then taking the first N rows from the result:

We want to make this workflow available as a web service where the caller can select the column to use for sorting and the value of N. It’s easy to change the sort column and N value in the KNIME Analytics Platform: we just configure those nodes and pick the values we want.


(click on the image to see it in full size)

This option isn’t available for web services, so we need to take another approach.

Will They Blend? Experiments in Data & Tool Blending. Today: SAS, SPSS, and MATLAB meet Amazon S3: setting the past free

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: SAS, SPSS, and MATLAB meet Amazon S3: setting the past free

The Challenge

I am an old guy. And old guys grew up with proprietary data formats for doing cool data science things. That means I have literally 100s of SAS, SPSS and MATLAB files on my hard disk and I would still love to use that data - but I no longer have the cash for the yearly licenses of those venerable old packages.

I know I can read all those files for free with KNIME. But what I REALLY want to do is move them into an open format and blend them somewhere modern like Amazon S3 or Microsoft Azure Blob Storage. But when I check out the various forums, like the SAS one, I find only horrendously complicated methods that - oh by the way - still requires an expensive license of the appropriate legacy tool. So it’s KNIME to the rescue and to make it easy, this example pulls files of all three legacy types and allows you to either move them or first convert them to an open source format before moving them to - in this example - Amazon S3.

It’s easy, it’s open, it’s beautiful…. and it’s free.

Topic. Recycling old data files from legacy tools.

Challenge. Transfer data from SAS, SPSS, and Matlab into Amazon S3 data lake passing through KNIME.

Access Mode. Access nodes to SAS, SPSS, and Matlab files and connection node to Amazon S3.

Will They Blend? Experiments in Data & Tool Blending. Today: XML meets JSON

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: XML meets JSON

The Challenge

Do you remember the first post of this series? Yes, the one about blending news headlines from IBM Watson News and Google News services about Barack Obama? Blending the news headlines involved a little side blending – i.e. blending JSON structured data – the response of Google News – with XML structured data – the response from IBM Watson News.

Today, the challenge is to parse and blend XML structured data with JSON structured data.  Recycling part of the original blog post workflow, we query IBM Watson AlchemyAPI News service for the first 100 news headlines on Barack Obama and the first 100 news headlines on Michelle Obama. We want the response for Barack Obama to be received in XML format and the response for Michelle Obama in JSON format. Two data sets: one for Barack Obama (XML) and one for Michelle Obama (JSON). Will they blend?

Topic. News Headlines on Barack Obama and separately on Michelle Obama from October 2016.

Challenge. Blend the two data response data sets respectively in XML and JSON format.

Access Mode. JSON and XML parsing for IBM Watson AlchemyAPI News REST service response.

Using custom data types with the Java Snippet node

One of the cool new features of KNIME Analytics Platform 3.3 is the ability to use the Java Snippet node with objects and functions that are defined in KNIME extensions. This allows an interesting and powerful new way to work with extensions that are compatible with the new functionality (more on the small amount of work required for that in a separate post, but for those who want to get a head start, here’s a link to the commit that added the functionality to the RDKit nodes). This post will demonstrate how to use the RDKit Java wrappers from within a KNIME Java Snippet node. Though this blog post uses cheminformatics functionality from the RDKit as a demonstration, the new feature can also be used to work with things like images, XML and JSON documents, and SVGs.

We’ll be working with the Java Snippet node, here’s some more information about that. Needless to say, you need to have KNIME 3.3 (or later) with the RDKit community nodes installed. As of this writing you need the nightly build of the RDKit nodes; the update site (http://update.knime.org/community-contributions/trunk) is linked from the KNIME Community site.

Example 1: Working with RDKit Molecules.

Let’s start by reading in a set of SMILES from a file, converting those to RDKit molecules, and then adding a Java Snippet node. Here’s the fragment of the workflow:

Will They Blend? Experiments in Data & Tool Blending. Today: MS Access meets H2. Test your Baseball Knowledge

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: MS Access meets H2. Test your Baseball Knowledge

The Challenge

Today’s challenge is sport related. How well do you know Major League Baseball? Do you know who had the best pitching and batting record between 1985 and 1990? Do you know who has been the highest-paid baseball player of all times?

For a long time, baseball has been, and arguably still is, the most data focused sport. The most famous documentation of data analytics in Major League Baseball is, of course, the Moneyball book (and subsequent movie), but data analysis in baseball has a much longer history.

For this blending challenge, we used batting and pitching statistics for all active players from 1985 to 2015 in the National League and the American League. Such data has been made publicly available through the Sean Lahman’s website. (A loud “thank you” to all of the site contributors for making this standard baseball encyclopedia publicly available.) The Lahman Database stores player statistics as well as data about managers, birthdates, awards, all-star games, and much more.

Now, in most companies each department owns specific data, sometimes even using different separated databases. For instance, salaries and demographic data are often owned by HR, while performance metrics are owned by Operations. In this experiment, we assume the HR Department hosts salaries and demographics data in a MS Access database and the Operations Department stores the performance metrics (batting and pitching stats, among others) inside an H2 database.

MS Access is part of the Microsoft Office package and therefore available on most Windows based PCs. H2 is a relatively new open source database downloadable at http://www.h2database.com/html/download.html.

Today’s technical challenge is to attempt to blend data from a MS Access database and an H2 database. Will they blend?

Afterwards, on the blended data, we will give a short-guided tour of the KNIME WebPortal, to detect the best-paid and/or best-performing baseball players for each decade.

Topic. Best paid and best performing baseball players from 1985 to 2015.

Challenge. Blend data from MS Access and H2 databases and guide the users through the analytics on a web browser.

Access Mode. Database Connectors and KNIME WebPortal.

Will They Blend? Experiments in Data & Tool Blending. Today: Amazon S3 meets MS Azure Blob Storage. A match made in the clouds

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: Amazon S3 meets MS Azure Blob Storage. A match made in the clouds

The Challenge

Today let’s leave the ground and move into the clouds! When we talk about clouds, two options come immediately to mind: the Amazon cloud and the MS Azure cloud. Both clouds offer proprietary bulk repositories (data lakes) to store data: Amazon cloud uses the S3 data repository and MS Azure has loaded the Blob Storage data repository.

Let’s suppose now that for some unpredictable twist of fate, we have ended up with data on both clouds. How could we make the two clouds communicate, so as to collect all data in a single place? It is a well-known fact that clouds rarely talk to each other.

Today’s challenge is to force the Amazon cloud and the MS Azure cloud to communicate and exchange data. That is, we want to blend data stored in an S3 data lake on the Amazon cloud with data stored in a Blob Storage data lake on the MS Azure cloud. Will they blend?

In the process, we will also put a few Excel files into the blender, just to keep our feet on the ground: after all every data analytics process has to deal with an Excel file or two.

Topic. Analyze the commuting time of Maine workers from the new CENSUS file.

Challenge. Blend together CENSUS file ss13hme.csv about homes in Maine and hosted on S3 on the Amazon cloud with file ss13pme.csv about people in Maine and hosted on Blob Storage on the MS Azure cloud.

Access Mode. Connection to Amazon S3 and connection to MS Blob Storage.

Will They Blend? Experiments in Data & Tool Blending. Today: Local vs. remote files. Will blending overcome the distance?

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: Local vs. remote files. Will blending overcome the distance?

The Challenge

Today’s challenge is distance: physical, geographical distance … between people and between compressed files.

Distance between people can easily be solved by any type of transportation. A flight before Christmas can take you back to your family just in time for the celebrations. What happens though if the flight is late? Better choose your airline carrier carefully to avoid undesired delays!

Distance between compressed files can easily be solved by KNIME. A few appropriate nodes can establish the right HTTP connection, download the file, and bring it home to the local files.

The goal is to visualize the ratio of departure delays in Chicago airport by carrier through a classic bar chart. We will take the data from the airline dataset and we will focus on two years only: 2007 and 2008. I worked on this dataset for another project and I already have the data for year 2008 zipped and stored locally on my laptop. I am missing the data for year 2007 but I can get them via the URL of the original web site.  

So on the one hand I have a ZIP file with the 2008 data from the airline data set here on my laptop. And on the other side I have a link to a ZIP file with the 2007 data on some server in some remote location, possibly close to the North Pole. Will KNIME fill the distance? Will they blend?

Topic. Departure delays by carrier.

Challenge. Collect airline data for 2007 and 2008 and display departure delay ratio by carrier from Chicago airport.

Access Mode. One file is accessed locally and one file is accessed remotely via an HTTP connection.

Integrated Tool. JavaScript based visualization.

Optimizing KNIME workflows for performance

KNIME® provides performance extensions such as the KNIME Big Data Connectors for executing Hive queries on Hadoop, or the KNIME Spark Executor for training models on Hadoop using Spark. But sometimes it doesn’t make sense to run your analytics on a Big Data cluster. Recently we released the KNIME Cloud Analytics Platform for Azure, which allows you to execute your KNIME workflows on demand, on an Azure VM with up to 448GB RAM and 32 cores which is one easy way to boost the performance of some of your workflows.

However, there are still a few extra tips and tricks that you can use to speed up execution of your workflows.

KNIME Analytics Platform follows a graphical programming paradigm, which, in addition to clearly expressing the steps taken to process your data, also allows rapid prototyping of ideas. The benefit of rapid prototyping is that you can quickly test an idea and prove the business value of that idea practically. Once you’ve shown that business value, you may deploy the workflow to run regularly on KNIME Server. Below I’ve highlighted some of the tips and tricks I’ve learned from KNIMErs that might help speed up the execution of some of your workflows. Read on to learn more.

Speaking Kerberos with KNIME Big Data Extensions

The ever-increasing application of big data technology brings with it the need to secure access to the data in your Hadoop cluster. Security in earlier Hadoop days only offered protection against accidental misuse as not even user credentials were properly enforced. Before Hadoop could be a true multi-user platform, where organizations can put all of their data and host many different applications, the security issue obviously had to be tackled. The Hadoop community addressed this problem by adopting the venerable Kerberos protocol. Kerberos – which dates back to 1993 – is an authentication protocol for distributed applications and support for it is also integrated into KNIME Big Data Extensions.

Using Kerberos is often a bit of a hassle. But don’t despair, in this blog post we will show you step-by-step how to properly configure a Windows client with KNIME Big Data Connectors, so that you can connect to Hive, Impala, and HDFS in a Kerberos-secured cluster.

By the way, KNIME Spark Executor also supports Kerberos. Detailed configuration steps can be found in the Installation Guide PDF on the product page.

If any of the Kerberos terminology is unclear, you can refer to the glossary at the bottom of this blog post.