In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?
Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.
Authors: Heather Fyson and Kilian Thiel
The Challenge“A plague o’ both your houses! They have made worms’ meat of me!” said Mercutio in Shakespeare’s “Romeo and Juliet” – in which tragedy results from the characters’ inability to communicate effectively. It is worsened by the fact that Romeo and Juliet each come from the feuding “two households”: Romeo a Montague and Juliet, a Capulet.
For this blog article, we decided to take a look at the interaction between the characters in the play by analyzing the script – an epub file – to see just who talks to who. Are the Montagues and Capulets really divided families? Do they really not communicate? To make the results easier to read, we decided to visualize the network as a graph, with each node in the graph representing a character in the play and showing an image of the particular character.
The “Romeo and Juliet” e-book is downloadable for free in a number of formats from the Gutenberg Project web site. For this experiment, we downloaded the epub file. epub is an e-book file format used in many e-reading devices, such as Amazon Kindle for example (for more information about the epub format, check https://en.wikipedia.org/wiki/EPUB).
The images for the characters of the Romeo and Juliet play have been kindly made available by Stadttheater Konstanz in a JPEG format from a live show. JPEG is a commonly used format to store images (for more information about the JPEG format, check https://en.wikipedia.org/wiki/JPEG).
Unlike the Montague and the Capulet families – will epub and JPEG files blend?
Topic. Analyzing the graph structure of the dialogs in Shakespeare’s tragedy “Romeo and Juliet”.
Challenge. Blending epub and JPEG files and combining text mining and network visualization.
Access Mode. epub parser and JPEG reader.
Reading and Processing the Text in the epub file
Loading the JPEG image files
Blending Text and Images in the Dialog Graph
The final workflow shown in Figure 1 is available on the KNIME EXAMPLES server under
08_Other_Analytics_Types/01_Text_Processing/18_epub_JPEG_Romeo_Juliet08_Other_Analytics_Types/01_Text_Processing/18_epub_JPEG_Romeo_Juliet*
The final graph showing the interaction degree between the different characters of the play is depicted in Figure 2.
Figure 1. This workflow successfully blends an a Kindle epub file of the play “Romeo and Juliet” with the JPEG images of the play’s characters in a live show in a dialog graph.

(click on the image to see it in full size)
Yes, they blend!
The graph that comes out of this experiment shows the clear separation between the two families quite impressively. All Montagues in green are on one side; all Capulets in red on the other. The two families only interact with each other through a small number of characters and, not surprisingly, most of the interaction that does take place between the families is between Romeo and Juliet. The separation is so neat that we are tempted to think that Shakespeare used a graph himself to deploy the tragedy dialogs!
In this experiment, we’ve managed to create and visualize the network of interaction between all characters from “Romeo and Juliet”, by parsing the epub text document, reading the JPEG images for all characters, and blending the results into the network.
So, even for this experiment, involving epub and JPEG files, text mining and network visualization, we can conclude that … yes, they blend!
Figure 2. Interaction network of characters from “Romeo and Juliet”. The border color of the nodes indicates the family assignment and the node size reflects the term-frequency of the character in the epub document. Edge thickness between two characters reflects their interaction degree throughout the play.

(click on the image to see it in full size)
If you enjoyed this, please share this generously and let us know your ideas for future blends.
We’re looking forward to the next challenge. What about Teradata and the KNIME .table format? Will they blend?
Thank you Stadttheater Konstanz for allowing us use the photos from a production of Romeo and Juliet.
Photographer: Ilja Mess.
* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)