KNIME general.

Will They Blend? Experiments in Data & Tool Blending. Today: SAS, SPSS, and MATLAB meet Amazon S3: setting the past free

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: SAS, SPSS, and MATLAB meet Amazon S3: setting the past free

The Challenge

I am an old guy. And old guys grew up with proprietary data formats for doing cool data science things. That means I have literally 100s of SAS, SPSS and MATLAB files on my hard disk and I would still love to use that data - but I no longer have the cash for the yearly licenses of those venerable old packages.

I know I can read all those files for free with KNIME. But what I REALLY want to do is move them into an open format and blend them somewhere modern like Amazon S3 or Microsoft Azure Blob Storage. But when I check out the various forums, like the SAS one, I find only horrendously complicated methods that - oh by the way - still requires an expensive license of the appropriate legacy tool. So it’s KNIME to the rescue and to make it easy, this example pulls files of all three legacy types and allows you to either move them or first convert them to an open source format before moving them to - in this example - Amazon S3.

It’s easy, it’s open, it’s beautiful…. and it’s free.

Topic. Recycling old data files from legacy tools.

Challenge. Transfer data from SAS, SPSS, and Matlab into Amazon S3 data lake passing through KNIME.

Access Mode. Access nodes to SAS, SPSS, and Matlab files and connection node to Amazon S3.

Will They Blend? Experiments in Data & Tool Blending. Today: XML meets JSON

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: XML meets JSON

The Challenge

Do you remember the first post of this series? Yes, the one about blending news headlines from IBM Watson News and Google News services about Barack Obama? Blending the news headlines involved a little side blending – i.e. blending JSON structured data – the response of Google News – with XML structured data – the response from IBM Watson News.

Today, the challenge is to parse and blend XML structured data with JSON structured data.  Recycling part of the original blog post workflow, we query IBM Watson AlchemyAPI News service for the first 100 news headlines on Barack Obama and the first 100 news headlines on Michelle Obama. We want the response for Barack Obama to be received in XML format and the response for Michelle Obama in JSON format. Two data sets: one for Barack Obama (XML) and one for Michelle Obama (JSON). Will they blend?

Topic. News Headlines on Barack Obama and separately on Michelle Obama from October 2016.

Challenge. Blend the two data response data sets respectively in XML and JSON format.

Access Mode. JSON and XML parsing for IBM Watson AlchemyAPI News REST service response.

Using custom data types with the Java Snippet node

One of the cool new features of KNIME Analytics Platform 3.3 is the ability to use the Java Snippet node with objects and functions that are defined in KNIME extensions. This allows an interesting and powerful new way to work with extensions that are compatible with the new functionality (more on the small amount of work required for that in a separate post, but for those who want to get a head start, here’s a link to the commit that added the functionality to the RDKit nodes). This post will demonstrate how to use the RDKit Java wrappers from within a KNIME Java Snippet node. Though this blog post uses cheminformatics functionality from the RDKit as a demonstration, the new feature can also be used to work with things like images, XML and JSON documents, and SVGs.

We’ll be working with the Java Snippet node, here’s some more information about that. Needless to say, you need to have KNIME 3.3 (or later) with the RDKit community nodes installed. As of this writing you need the nightly build of the RDKit nodes; the update site (http://update.knime.org/community-contributions/trunk) is linked from the KNIME Community site.

Example 1: Working with RDKit Molecules.

Let’s start by reading in a set of SMILES from a file, converting those to RDKit molecules, and then adding a Java Snippet node. Here’s the fragment of the workflow:

Will They Blend? Experiments in Data & Tool Blending. Today: MS Access meets H2. Test your Baseball Knowledge

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: MS Access meets H2. Test your Baseball Knowledge

The Challenge

Today’s challenge is sport related. How well do you know Major League Baseball? Do you know who had the best pitching and batting record between 1985 and 1990? Do you know who has been the highest-paid baseball player of all times?

For a long time, baseball has been, and arguably still is, the most data focused sport. The most famous documentation of data analytics in Major League Baseball is, of course, the Moneyball book (and subsequent movie), but data analysis in baseball has a much longer history.

For this blending challenge, we used batting and pitching statistics for all active players from 1985 to 2015 in the National League and the American League. Such data has been made publicly available through the Sean Lahman’s website. (A loud “thank you” to all of the site contributors for making this standard baseball encyclopedia publicly available.) The Lahman Database stores player statistics as well as data about managers, birthdates, awards, all-star games, and much more.

Now, in most companies each department owns specific data, sometimes even using different separated databases. For instance, salaries and demographic data are often owned by HR, while performance metrics are owned by Operations. In this experiment, we assume the HR Department hosts salaries and demographics data in a MS Access database and the Operations Department stores the performance metrics (batting and pitching stats, among others) inside an H2 database.

MS Access is part of the Microsoft Office package and therefore available on most Windows based PCs. H2 is a relatively new open source database downloadable at http://www.h2database.com/html/download.html.

Today’s technical challenge is to attempt to blend data from a MS Access database and an H2 database. Will they blend?

Afterwards, on the blended data, we will give a short-guided tour of the KNIME WebPortal, to detect the best-paid and/or best-performing baseball players for each decade.

Topic. Best paid and best performing baseball players from 1985 to 2015.

Challenge. Blend data from MS Access and H2 databases and guide the users through the analytics on a web browser.

Access Mode. Database Connectors and KNIME WebPortal.

Will They Blend? Experiments in Data & Tool Blending. Today: Amazon S3 meets MS Azure Blob Storage. A match made in the clouds

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: Amazon S3 meets MS Azure Blob Storage. A match made in the clouds

The Challenge

Today let’s leave the ground and move into the clouds! When we talk about clouds, two options come immediately to mind: the Amazon cloud and the MS Azure cloud. Both clouds offer proprietary bulk repositories (data lakes) to store data: Amazon cloud uses the S3 data repository and MS Azure has loaded the Blob Storage data repository.

Let’s suppose now that for some unpredictable twist of fate, we have ended up with data on both clouds. How could we make the two clouds communicate, so as to collect all data in a single place? It is a well-known fact that clouds rarely talk to each other.

Today’s challenge is to force the Amazon cloud and the MS Azure cloud to communicate and exchange data. That is, we want to blend data stored in an S3 data lake on the Amazon cloud with data stored in a Blob Storage data lake on the MS Azure cloud. Will they blend?

In the process, we will also put a few Excel files into the blender, just to keep our feet on the ground: after all every data analytics process has to deal with an Excel file or two.

Topic. Analyze the commuting time of Maine workers from the new CENSUS file.

Challenge. Blend together CENSUS file ss13hme.csv about homes in Maine and hosted on S3 on the Amazon cloud with file ss13pme.csv about people in Maine and hosted on Blob Storage on the MS Azure cloud.

Access Mode. Connection to Amazon S3 and connection to MS Blob Storage.

Will They Blend? Experiments in Data & Tool Blending. Today: Local vs. remote files. Will blending overcome the distance?

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: Local vs. remote files. Will blending overcome the distance?

The Challenge

Today’s challenge is distance: physical, geographical distance … between people and between compressed files.

Distance between people can easily be solved by any type of transportation. A flight before Christmas can take you back to your family just in time for the celebrations. What happens though if the flight is late? Better choose your airline carrier carefully to avoid undesired delays!

Distance between compressed files can easily be solved by KNIME. A few appropriate nodes can establish the right HTTP connection, download the file, and bring it home to the local files.

The goal is to visualize the ratio of departure delays in Chicago airport by carrier through a classic bar chart. We will take the data from the airline dataset and we will focus on two years only: 2007 and 2008. I worked on this dataset for another project and I already have the data for year 2008 zipped and stored locally on my laptop. I am missing the data for year 2007 but I can get them via the URL of the original web site.  

So on the one hand I have a ZIP file with the 2008 data from the airline data set here on my laptop. And on the other side I have a link to a ZIP file with the 2007 data on some server in some remote location, possibly close to the North Pole. Will KNIME fill the distance? Will they blend?

Topic. Departure delays by carrier.

Challenge. Collect airline data for 2007 and 2008 and display departure delay ratio by carrier from Chicago airport.

Access Mode. One file is accessed locally and one file is accessed remotely via an HTTP connection.

Integrated Tool. JavaScript based visualization.

Will They Blend? Experiments in Data & Tool Blending. Today: MS Word meets Web Crawling. Identifying the Secret Ingredient

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: MS Word meets Web Crawling. Identifying the Secret Ingredient

Authors: Roland Burger and Heather Fyson

The Challenge

It’s Christmas again, and like every Christmas, I am ready to bake my famous Christmas cookies. They’re great and have been praised by many! My recipes are well-kept secrets as I am painfully aware of the competition. There are recipes galore on the web nowadays. I need to find out more about these internet-hosted recipes and their ingredients, particularly any ingredient I might have overlooked over the years.

My recipes are stored securely in an MS Word document on my computer, while a very well regarded web site for cookie recipes is http://www.handletheheat.com/peanut-butter-snickerdoodles/. I need to compare my own recipe with this one on the web and find out what makes them different. What is the secret ingredient for the ultimate Christmas cookie?

In practice, on one side I want to read and parse my Word recipe texts and on the other side I want to use web crawling to read and parse this web-hosted recipe. The question as usual is: will they blend?

Topic. Christmas cookies.

Challenge. Identifying secret ingredients in cookie recipes from the web by comparing them to my own recipes stored locally in an MS Word document.

Access Mode. MS Word .docx file reader and parser and web crawler nodes.

Will they blend? Experiments in Data & Tool Blending. Today: A Cross-Platform Ensemble Model: R meets Python and KNIME. Embrace Freedom in the Data Science Lab

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: A Cross-Platform Ensemble Model: R meets Python and KNIME. Embrace Freedom in the Data Science Lab

The Challenge

Today’s challenge consists of building a cross-platform ensemble model. The ensemble model must collect a Support Vector Machine (SVM), a logistic regression, and a decision tree.  Let’s raise the bar even more and train these models on different analytics platforms: R, Python, and of course KNIME. (Note that we, of course, could create all those models in KNIME but that would kill the rest of the story...)

A small group of three data scientists was given the task to predict flight departure delays from Chicago O’Hare (ORD) airport, based on the airline data set. As soon as the data came in, all data scientists built a model in record time. I mean, each one of them built a different model on a different platform! We ended up with a Python script to build a logistic regression, an R script to build an SVM, and a KNIME workflow to train a decision tree. Which one should we choose?

We had two options here: select the best model and claim the champion; embrace diversity and build an ensemble model. Since more is usually better than less, we opted for the ensemble model. Thus, we just needed to convince two out of the three data scientists to switch analytics platform.

Or maybe not.

Thanks to its open architecture, KNIME can easily integrate R and Python scripts. In this way, every data scientist can use his/her preferred analytics platform, while KNIME collects and fuses the results.

Today’s challenge has three main characters: a decision tree built on KNIME Analytics Platform, an SVM built in R, and a logistic regression built with Python. Will they blend?

Topic. Flight departure delays from Chicago O’Hare (ORD) Airport.

Challenge. Build a cross-platform ensemble model, by blending an R SVM, a Python logistic regression, and a KNIME decision tree.

KNIME Extensions. Python and R Integrations.

The Node Guide. Finding Help in KNIME Analytics Platform and on the Web

Did you ever ask yourself, while using KNIME Analytics Platform, “What should I do next?” Or “How can I use this node?” Or “What on earth is this parameter for?!” No matter if you are new to  KNIME or already an expert, I’m sure you have asked these questions sometimes and might still be wondering about them.

There are many ways already to find the answers.

  • Read the node documentation provided in the KNIME workbench
  • Go to the KNIME forum to ask for advice about how a node can be used
  • Check out the EXAMPLES Server for demos on particular use cases, with detailed annotations and organized categories.

In addition to these three options, we have recently added new help content to our website and to KNIME Analytics Platform…

…Please welcome the Node Guide! Together with the Workflow Coach, they are your new best friends. The information provided here is intended for beginners and experienced users of KNIME alike.

The Node Guide is a searchable web reference for nodes, with example workflows that demonstrate how they are used. You might have used it already. In fact, Node Guide pages now pop up in search engines as answers to KNIME node questions; questions in the KNIME Forum are increasingly answered with links to Node Guide pages; and a new entry in the Learning section of the KNIME web site takes you directly to the Node Guide page.

Will they blend? Experiments in Data & Tool Blending. Today: Twitter meets PostgreSQL. More than idle chat?

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: Twitter meets PostgreSQL. More than idle chat?

The Challenge

Today we will trespass into the world of idle chatting. Since Twitter is, as everybody knows THE place for idle chatting, our blending goal for today is a mini-DWH application to archive tweets day by day. The tweet topic can be anything, but for this experiment we investigate what people say about #KNIME through the word cloud from last month’s tweets.

Now, if you connect to Twitter with a free developer account, you will only receive the most recent tweets about the chosen topic. If you are interested in all tweets from last month for example, you need to regularly download and store them into an archive. That is, you need to build a Data WareHousing (DWH) application to archive past tweets.

As archival tool for this experiment, we chose a PostgreSQL database. The DWH application at the least should download and store yesterday’s tweets. As a bonus, it could also create the word cloud from all tweets posted in the past month. That is, it should combine yesterday’s tweets from Twitter and last month’s tweets from the archive to build the desired word cloud.  

Summarizing, on one side, we collect yesterday’s tweets directly from Twitter, on the other side we retrieve past tweets from a PostgreSQL database. Will they blend?

Topic. What people say about #KNIME on Twitter.

Challenge. Blend together tweets from Twitter and tweets from PostgreSQL database and draw the corresponding word cloud.

Access Mode. Twitter access and Database access.