In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?
Follow us here and send us your ideas for the next data blending challenge you’d like to see at email@example.com.
Authors: Roland Burger and Heather Fyson
It’s Christmas again, and like every Christmas, I am ready to bake my famous Christmas cookies. They’re great and have been praised by many! My recipes are well-kept secrets as I am painfully aware of the competition. There are recipes galore on the web nowadays. I need to find out more about these internet-hosted recipes and their ingredients, particularly any ingredient I might have overlooked over the years.
My recipes are stored securely in an MS Word document on my computer, while a very well regarded web site for cookie recipes is http://www.handletheheat.com/peanut-butter-snickerdoodles/. I need to compare my own recipe with this one on the web and find out what makes them different. What is the secret ingredient for the ultimate Christmas cookie?
In practice, on one side I want to read and parse my Word recipe texts and on the other side I want to use web crawling to read and parse this web-hosted recipe. The question as usual is: will they blend?
The Text Processing Extension of KNIME Analytics Platform version 3.3 includes the Tika library with a Tika Parser node that can read and parse virtually any kind of document: docx, pdf, 7z, bmp, epub, gif, groovy, java, ihtml, mp3, mp4, odc, odp, pptx, pub, rtf, tar, wav, xls, zip, and many more!
The KNIME Text Processing extension can be installed from KNIME Labs Extension / KNIME TextProcessing. (A video on YouTube explains how to install KNIME extensions, if you need it.)
Reading the MS Word file
Extracting the list of ingredients
Here we use a KNIME Community Extension, the Palladian extension, to crawl the web. It’s available under KNIME Community Contributions – Other / Palladian for KNIME. (A video on YouTube explains how to install KNIME extensions.) After installing it, we proceed to crawl the page http://www.handletheheat.com/peanut-butter-snickerdoodles/.
Crawling the web page.
Now that we have the recipe text, we repeat the same steps described above in “Extracting the list of ingredients”.
Two lists of ingredients have emerged from the two workflow branches. We need to compare them and discover the secret ingredient – if any – in the new recipe as downloaded from the web.
The workflow used for this experiment is available for download from the EXAMPLES server as usual under 08_Other_Analytics_Types/01_Text_Processing/10_Discover_Secret_Ingredient08_Other_Analytics_Types/01_Text_Processing/10_Discover_Secret_Ingredient*
Figure 1. This workflow successfully blends ingredients from recipes in a Word document with ingredients from a recipe downloaded from the Web.
Yes, they blend!
The main result is that data from a MS Word document and data from web crawling do blend! We have also shown that ingredients from recipes in a Word document and ingredients from web-based recipe also blend!
But what about the secret ingredient for the ultimate cookie? Well, thanks to this experiment I have discovered something I did not know: peanut butter cookies. If you check the final report in figure 2, you can see the word “peanut” emerging red from the gray of traditional ingredients. I will try out this new recipe and see how it compares with my own cookie recipes. If the result is a success I might add it to my list of closely guarded secret recipes!
Happy Holidays everyone!
Figure 2. The report resulting from the workflow above, identifying peanut (in red) as the secret ingredient of the cookie competition.
Credit for the web hosted recipe goes to Tessa. More infos at http://www.handletheheat.com/peanut-butter-snickerdoodles/
If you enjoyed this, please share it generously and let us know your ideas for future blends.
We will pause now with the “Will they blend?” series for the Christmas vacation. We will be back in early January 2017. There, we will mix JSON with XML formatted data. Will they blend?
* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)