KNIME news, usage, and development

Will They Blend? Experiments in Data & Tool Blending. Today: SAS, SPSS, and MATLAB meet Amazon S3: setting the past free

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: SAS, SPSS, and MATLAB meet Amazon S3: setting the past free

The Challenge

I am an old guy. And old guys grew up with proprietary data formats for doing cool data science things. That means I have literally 100s of SAS, SPSS and MATLAB files on my hard disk and I would still love to use that data - but I no longer have the cash for the yearly licenses of those venerable old packages.

I know I can read all those files for free with KNIME. But what I REALLY want to do is move them into an open format and blend them somewhere modern like Amazon S3 or Microsoft Azure Blob Storage. But when I check out the various forums, like the SAS one, I find only horrendously complicated methods that - oh by the way - still requires an expensive license of the appropriate legacy tool. So it’s KNIME to the rescue and to make it easy, this example pulls files of all three legacy types and allows you to either move them or first convert them to an open source format before moving them to - in this example - Amazon S3.

It’s easy, it’s open, it’s beautiful…. and it’s free.

Topic. Recycling old data files from legacy tools.

Challenge. Transfer data from SAS, SPSS, and Matlab into Amazon S3 data lake passing through KNIME.

Access Mode. Access nodes to SAS, SPSS, and Matlab files and connection node to Amazon S3.

Will They Blend? Experiments in Data & Tool Blending. Today: XML meets JSON

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: XML meets JSON

The Challenge

Do you remember the first post of this series? Yes, the one about blending news headlines from IBM Watson News and Google News services about Barack Obama? Blending the news headlines involved a little side blending – i.e. blending JSON structured data – the response of Google News – with XML structured data – the response from IBM Watson News.

Today, the challenge is to parse and blend XML structured data with JSON structured data.  Recycling part of the original blog post workflow, we query IBM Watson AlchemyAPI News service for the first 100 news headlines on Barack Obama and the first 100 news headlines on Michelle Obama. We want the response for Barack Obama to be received in XML format and the response for Michelle Obama in JSON format. Two data sets: one for Barack Obama (XML) and one for Michelle Obama (JSON). Will they blend?

Topic. News Headlines on Barack Obama and separately on Michelle Obama from October 2016.

Challenge. Blend the two data response data sets respectively in XML and JSON format.

Access Mode. JSON and XML parsing for IBM Watson AlchemyAPI News REST service response.

Using custom data types with the Java Snippet node

One of the cool new features of KNIME Analytics Platform 3.3 is the ability to use the Java Snippet node with objects and functions that are defined in KNIME extensions. This allows an interesting and powerful new way to work with extensions that are compatible with the new functionality (more on the small amount of work required for that in a separate post, but for those who want to get a head start, here’s a link to the commit that added the functionality to the RDKit nodes). This post will demonstrate how to use the RDKit Java wrappers from within a KNIME Java Snippet node. Though this blog post uses cheminformatics functionality from the RDKit as a demonstration, the new feature can also be used to work with things like images, XML and JSON documents, and SVGs.

We’ll be working with the Java Snippet node, here’s some more information about that. Needless to say, you need to have KNIME 3.3 (or later) with the RDKit community nodes installed. As of this writing you need the nightly build of the RDKit nodes; the update site (http://update.knime.org/community-contributions/trunk) is linked from the KNIME Community site.

Example 1: Working with RDKit Molecules.

Let’s start by reading in a set of SMILES from a file, converting those to RDKit molecules, and then adding a Java Snippet node. Here’s the fragment of the workflow:

Will They Blend? Experiments in Data & Tool Blending. Today: MS Access meets H2. Test your Baseball Knowledge

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: MS Access meets H2. Test your Baseball Knowledge

The Challenge

Today’s challenge is sport related. How well do you know Major League Baseball? Do you know who had the best pitching and batting record between 1985 and 1990? Do you know who has been the highest-paid baseball player of all times?

For a long time, baseball has been, and arguably still is, the most data focused sport. The most famous documentation of data analytics in Major League Baseball is, of course, the Moneyball book (and subsequent movie), but data analysis in baseball has a much longer history.

For this blending challenge, we used batting and pitching statistics for all active players from 1985 to 2015 in the National League and the American League. Such data has been made publicly available through the Sean Lahman’s website. (A loud “thank you” to all of the site contributors for making this standard baseball encyclopedia publicly available.) The Lahman Database stores player statistics as well as data about managers, birthdates, awards, all-star games, and much more.

Now, in most companies each department owns specific data, sometimes even using different separated databases. For instance, salaries and demographic data are often owned by HR, while performance metrics are owned by Operations. In this experiment, we assume the HR Department hosts salaries and demographics data in a MS Access database and the Operations Department stores the performance metrics (batting and pitching stats, among others) inside an H2 database.

MS Access is part of the Microsoft Office package and therefore available on most Windows based PCs. H2 is a relatively new open source database downloadable at http://www.h2database.com/html/download.html.

Today’s technical challenge is to attempt to blend data from a MS Access database and an H2 database. Will they blend?

Afterwards, on the blended data, we will give a short-guided tour of the KNIME WebPortal, to detect the best-paid and/or best-performing baseball players for each decade.

Topic. Best paid and best performing baseball players from 1985 to 2015.

Challenge. Blend data from MS Access and H2 databases and guide the users through the analytics on a web browser.

Access Mode. Database Connectors and KNIME WebPortal.

Will They Blend? Experiments in Data & Tool Blending. Today: Amazon S3 meets MS Azure Blob Storage. A match made in the clouds

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: Amazon S3 meets MS Azure Blob Storage. A match made in the clouds

The Challenge

Today let’s leave the ground and move into the clouds! When we talk about clouds, two options come immediately to mind: the Amazon cloud and the MS Azure cloud. Both clouds offer proprietary bulk repositories (data lakes) to store data: Amazon cloud uses the S3 data repository and MS Azure has loaded the Blob Storage data repository.

Let’s suppose now that for some unpredictable twist of fate, we have ended up with data on both clouds. How could we make the two clouds communicate, so as to collect all data in a single place? It is a well-known fact that clouds rarely talk to each other.

Today’s challenge is to force the Amazon cloud and the MS Azure cloud to communicate and exchange data. That is, we want to blend data stored in an S3 data lake on the Amazon cloud with data stored in a Blob Storage data lake on the MS Azure cloud. Will they blend?

In the process, we will also put a few Excel files into the blender, just to keep our feet on the ground: after all every data analytics process has to deal with an Excel file or two.

Topic. Analyze the commuting time of Maine workers from the new CENSUS file.

Challenge. Blend together CENSUS file ss13hme.csv about homes in Maine and hosted on S3 on the Amazon cloud with file ss13pme.csv about people in Maine and hosted on Blob Storage on the MS Azure cloud.

Access Mode. Connection to Amazon S3 and connection to MS Blob Storage.

Will They Blend? Experiments in Data & Tool Blending. Today: Local vs. remote files. Will blending overcome the distance?

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: Local vs. remote files. Will blending overcome the distance?

The Challenge

Today’s challenge is distance: physical, geographical distance … between people and between compressed files.

Distance between people can easily be solved by any type of transportation. A flight before Christmas can take you back to your family just in time for the celebrations. What happens though if the flight is late? Better choose your airline carrier carefully to avoid undesired delays!

Distance between compressed files can easily be solved by KNIME. A few appropriate nodes can establish the right HTTP connection, download the file, and bring it home to the local files.

The goal is to visualize the ratio of departure delays in Chicago airport by carrier through a classic bar chart. We will take the data from the airline dataset and we will focus on two years only: 2007 and 2008. I worked on this dataset for another project and I already have the data for year 2008 zipped and stored locally on my laptop. I am missing the data for year 2007 but I can get them via the URL of the original web site.  

So on the one hand I have a ZIP file with the 2008 data from the airline data set here on my laptop. And on the other side I have a link to a ZIP file with the 2007 data on some server in some remote location, possibly close to the North Pole. Will KNIME fill the distance? Will they blend?

Topic. Departure delays by carrier.

Challenge. Collect airline data for 2007 and 2008 and display departure delay ratio by carrier from Chicago airport.

Access Mode. One file is accessed locally and one file is accessed remotely via an HTTP connection.

Integrated Tool. JavaScript based visualization.

Optimizing KNIME workflows for performance

KNIME® provides performance extensions such as the KNIME Big Data Connectors for executing Hive queries on Hadoop, or the KNIME Spark Executor for training models on Hadoop using Spark. But sometimes it doesn’t make sense to run your analytics on a Big Data cluster. Recently we released the KNIME Cloud Analytics Platform for Azure, which allows you to execute your KNIME workflows on demand, on an Azure VM with up to 448GB RAM and 32 cores which is one easy way to boost the performance of some of your workflows.

However, there are still a few extra tips and tricks that you can use to speed up execution of your workflows.

KNIME Analytics Platform follows a graphical programming paradigm, which, in addition to clearly expressing the steps taken to process your data, also allows rapid prototyping of ideas. The benefit of rapid prototyping is that you can quickly test an idea and prove the business value of that idea practically. Once you’ve shown that business value, you may deploy the workflow to run regularly on KNIME Server. Below I’ve highlighted some of the tips and tricks I’ve learned from KNIMErs that might help speed up the execution of some of your workflows. Read on to learn more.

Speaking Kerberos with KNIME Big Data Extensions

The ever-increasing application of big data technology brings with it the need to secure access to the data in your Hadoop cluster. Security in earlier Hadoop days only offered protection against accidental misuse as not even user credentials were properly enforced. Before Hadoop could be a true multi-user platform, where organizations can put all of their data and host many different applications, the security issue obviously had to be tackled. The Hadoop community addressed this problem by adopting the venerable Kerberos protocol. Kerberos – which dates back to 1993 – is an authentication protocol for distributed applications and support for it is also integrated into KNIME Big Data Extensions.

Using Kerberos is often a bit of a hassle. But don’t despair, in this blog post we will show you step-by-step how to properly configure a Windows client with KNIME Big Data Connectors, so that you can connect to Hive, Impala, and HDFS in a Kerberos-secured cluster.

By the way, KNIME Spark Executor also supports Kerberos. Detailed configuration steps can be found in the Installation Guide PDF on the product page.

If any of the Kerberos terminology is unclear, you can refer to the glossary at the bottom of this blog post.

Will They Blend? Experiments in Data & Tool Blending. Today: MS Word meets Web Crawling. Identifying the Secret Ingredient

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: MS Word meets Web Crawling. Identifying the Secret Ingredient

Authors: Roland Burger and Heather Fyson

The Challenge

It’s Christmas again, and like every Christmas, I am ready to bake my famous Christmas cookies. They’re great and have been praised by many! My recipes are well-kept secrets as I am painfully aware of the competition. There are recipes galore on the web nowadays. I need to find out more about these internet-hosted recipes and their ingredients, particularly any ingredient I might have overlooked over the years.

My recipes are stored securely in an MS Word document on my computer, while a very well regarded web site for cookie recipes is http://www.handletheheat.com/peanut-butter-snickerdoodles/. I need to compare my own recipe with this one on the web and find out what makes them different. What is the secret ingredient for the ultimate Christmas cookie?

In practice, on one side I want to read and parse my Word recipe texts and on the other side I want to use web crawling to read and parse this web-hosted recipe. The question as usual is: will they blend?

Topic. Christmas cookies.

Challenge. Identifying secret ingredients in cookie recipes from the web by comparing them to my own recipes stored locally in an MS Word document.

Access Mode. MS Word .docx file reader and parser and web crawler nodes.

Will they blend? Experiments in Data & Tool Blending. Today: A Cross-Platform Ensemble Model: R meets Python and KNIME. Embrace Freedom in the Data Science Lab

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: A Cross-Platform Ensemble Model: R meets Python and KNIME. Embrace Freedom in the Data Science Lab

The Challenge

Today’s challenge consists of building a cross-platform ensemble model. The ensemble model must collect a Support Vector Machine (SVM), a logistic regression, and a decision tree.  Let’s raise the bar even more and train these models on different analytics platforms: R, Python, and of course KNIME. (Note that we, of course, could create all those models in KNIME but that would kill the rest of the story...)

A small group of three data scientists was given the task to predict flight departure delays from Chicago O’Hare (ORD) airport, based on the airline data set. As soon as the data came in, all data scientists built a model in record time. I mean, each one of them built a different model on a different platform! We ended up with a Python script to build a logistic regression, an R script to build an SVM, and a KNIME workflow to train a decision tree. Which one should we choose?

We had two options here: select the best model and claim the champion; embrace diversity and build an ensemble model. Since more is usually better than less, we opted for the ensemble model. Thus, we just needed to convince two out of the three data scientists to switch analytics platform.

Or maybe not.

Thanks to its open architecture, KNIME can easily integrate R and Python scripts. In this way, every data scientist can use his/her preferred analytics platform, while KNIME collects and fuses the results.

Today’s challenge has three main characters: a decision tree built on KNIME Analytics Platform, an SVM built in R, and a logistic regression built with Python. Will they blend?

Topic. Flight departure delays from Chicago O’Hare (ORD) Airport.

Challenge. Build a cross-platform ensemble model, by blending an R SVM, a Python logistic regression, and a KNIME decision tree.

KNIME Extensions. Python and R Integrations.