Tipps and trick using knime.

Creating RESTful services with KNIME, an update

Creating a service

We’ve had a couple of posts in the past about creating RESTful services with the KNIME Analytics Platform and using the REST API provided by the KNIME Server. But since we keep adding functionality and making things easier, it’s worthwhile to occasionally come back and revisit the topic. This post will demonstrate a couple of changes since Jon’s last update.

We’ll start with this simple workflow, which finds the top N rows of a database by sorting on a particular column and then taking the first N rows from the result:

We want to make this workflow available as a web service where the caller can select the column to use for sorting and the value of N. It’s easy to change the sort column and N value in the KNIME Analytics Platform: we just configure those nodes and pick the values we want.


(click on the image to see it in full size)

This option isn’t available for web services, so we need to take another approach.

Optimizing KNIME workflows for performance

KNIME® provides performance extensions such as the KNIME Big Data Connectors for executing Hive queries on Hadoop, or the KNIME Spark Executor for training models on Hadoop using Spark. But sometimes it doesn’t make sense to run your analytics on a Big Data cluster. Recently we released the KNIME Cloud Analytics Platform for Azure, which allows you to execute your KNIME workflows on demand, on an Azure VM with up to 448GB RAM and 32 cores which is one easy way to boost the performance of some of your workflows.

However, there are still a few extra tips and tricks that you can use to speed up execution of your workflows.

KNIME Analytics Platform follows a graphical programming paradigm, which, in addition to clearly expressing the steps taken to process your data, also allows rapid prototyping of ideas. The benefit of rapid prototyping is that you can quickly test an idea and prove the business value of that idea practically. Once you’ve shown that business value, you may deploy the workflow to run regularly on KNIME Server. Below I’ve highlighted some of the tips and tricks I’ve learned from KNIMErs that might help speed up the execution of some of your workflows. Read on to learn more.

Speaking Kerberos with KNIME Big Data Extensions

The ever-increasing application of big data technology brings with it the need to secure access to the data in your Hadoop cluster. Security in earlier Hadoop days only offered protection against accidental misuse as not even user credentials were properly enforced. Before Hadoop could be a true multi-user platform, where organizations can put all of their data and host many different applications, the security issue obviously had to be tackled. The Hadoop community addressed this problem by adopting the venerable Kerberos protocol. Kerberos – which dates back to 1993 – is an authentication protocol for distributed applications and support for it is also integrated into KNIME Big Data Extensions.

Using Kerberos is often a bit of a hassle. But don’t despair, in this blog post we will show you step-by-step how to properly configure a Windows client with KNIME Big Data Connectors, so that you can connect to Hive, Impala, and HDFS in a Kerberos-secured cluster.

By the way, KNIME Spark Executor also supports Kerberos. Detailed configuration steps can be found in the Installation Guide PDF on the product page.

If any of the Kerberos terminology is unclear, you can refer to the glossary at the bottom of this blog post.

Not all the REST, but a bit more of it in KNIME Server 4.3

The latest version of KNIME Server 4.3 brings some additions to its REST interface. In this article I will present some of them and how they can be used by client programs. Before we start I should mention that the Mason specification that we are using as the response format has changed slightly and we have adapted KNIME Server accordingly. You may want to have a look at the current version in case you have been using the Mason metadata.

Up and down

The most important addition is the possibility to up- and download items from the workflow repository. The usage is straightforward once you know the addresses. If you read our previous articles on the REST interface ([1], [2]), you may recall that the the workflow repository can be browsed. The entry point for the repository is "/rest/v4/repository/". The meta data of items in the repository can be queried by simply appending the item's path to the base address, e.g. "/rest/v4/repository/workflow", and issuing a GET request. This will return the name, owner, type, as well as some other properties. Note that this also means that the same address cannot be used to access the actual contents of the item. Instead you have to append ":data" to the item's URL and issue a GET request: GET http:///server/knime/rest/v4/repository/workflow:data. You will also find this address in the @control block of the Mason structure for the particular item (but only if you have permissions to download the item). The result will be a ZIP archive that contains the workflow, data file, or even the whole workflow group. The URL takes an optional query parameter "compressionLevel" with which you can control the compression on the server side (value between 0-9, the default is 1). As any browser can issue a GET request when you paste an URL into the address field, you can easily try this out yourself.

Upload works in the very same way, except that the HTTP method that is used is PUT. The expected request content is also a ZIP archive containing the item to be uploaded. Please note that the request's content type must be "application/zip" otherwise the server will reject it. The ZIP archive is expected to contain exactly one entry at the root level, either the data file or the folder of the workflow or workflow group.

Semantic enrichment of textual documents

Author: Julien Grossmann, Political and Violent Risk Data Analyst, HIS Economic and Country Risk

I’ve been using KNIME Analytics Platform for a year and a half, and in this time, KNIME has become a vital part of my work. As a political and violent risks data scientist, I am often confronted with incomplete or badly structured data. But with KNIME, I can always find ways to efficiently clean up, organize and analyze my data, and this, despite a total lack of programming or coding knowledge.

One of the most frustrating aspects of working with unstructured or semi structured data, is that they are rarely ready made for your needs. Therefore, they always require extensive manual clean up.

Worse, if your needs are constantly shifting, the data restructuration and clean up are too.

For instance, the typical data I work with involve datasets of events (such as civil unrest or terrorism attacks). Most data sets have basic meta data like date, location, and a short description of the incident, but there is only so much you can conclude from such basic dataset. Often I need to drill down, and look for specific groups, or specific actions or targets.

In the old days, we would use basic search functions in excel, or for the less geeky of us, we would do it manually.

So I decided to create a little workflow using the KNIME Textprocessing extension. The idea was to be able to mine large datasets rapidly, using a customizable list of keywords.

From D3 example to interactive KNIME view in 10 minutes

Interactive visualizations on the web have become very popular recently. The JavaScript framework D3 has developed into one of the most used libraries by many websites and publishers for rich graphics and visualizations.

In this blog post I’d like to show you how easy it is to harness D3’s capabilities and bring them into KNIME. To do so I am taking an existing example from the D3 website and will create an interactive view using the Generic JavaScript view in just 10 minutes.

So let’s get started.

Streaming data in KNIME

The KNIME Streaming Executor is an extension that currently "hides" inside KNIME Labs. Not many of our users are aware it exists, so let's find out what it can (and can't) do and why you should be interested.

Streaming?

If you are used to KNIME's standard execution model you will know that connected nodes in a workflow are executed one after the other. The obvious advantage is that you can easily understand what is currently executing and what the (intermediate) results are, produced by each of the nodes in your workflow. That significantly simplifies debugging, because you can see immediately if some intermediate step isn't producing the expected result. You can also reset/continue with the execution at any point in the workflow without re-running the whole workflow -- saving on computation time. Another benefit of the standard execution model is that you can use KNIME's Hiliting to inspect the results and also explore your data using some of the view nodes.

However, the standard execution model also has some drawbacks. Each node needs to cache its results, which requires additional temporary space. (Note that KNIME doesn't duplicate the data from node to node but saves only what has changed between subsequent nodes ... so this isn't as bad as it might sound.) Additionally, as a node doesn't start processing the data until its upstream node has completed, it will need to read the dataset from start to end. Depending on whether those data live in main memory or on hard disc (which is KNIME's choice and depends on data size + available memory) this may require additional I/O operations, which can slow things down. You will notice that if you have a long chain of simple preprocessing nodes, most of the time is spent on just reading the data and caching the intermediate results.

The KNIME Server REST API

The KNIME® Server REST API has already been covered from a design perspective in this blog post by Thorsten. My blog post aims to give you a hands on guide to getting started with the REST API. I’ll show some tools that you can use to get started and point you in the direction of some libraries that might help. But first I’ll discuss why you might want to use the REST API.

REST is a design pattern used for building networked applications. Practically, it is increasingly used in modern IT applications to allow developers and architects to build robust, flexible applications by integrating tools from various sources. With KNIME Server, you can use REST to give access to your workflows to external applications. That might mean that you build a workflow to predict something, and then an external application can trigger that workflow to predict on the data that it is interested in.

A typical use case for a REST service would be to integrate the results from KNIME workflows into an existing, often complex IT infrastructure. More concretely, you might have a KNIME workflow that predicts whether your customers are feeling good, or bad about your company today. Useful information, but only if this can be supplied to your colleagues in a timely manner. In the example of a call centre, it would allow colleagues to follow a script that will help to find the reason for the negative sentiment, and hopefully turn an unhappy customer into a happy customer.

Sentiment Analysis with N-grams

Sentiment analysis of free-text documents is a common task in the field of text mining. In sentiment analysis predefined sentiment labels, such as "positive" or "negative" are assigned to text documents.

In a previous article we described how a predictive model was built to predict the sentiment labels of documents (positive or negative). In this approach single words were used as features. It was shown that the most discriminative terms w.r.t. separating the two classes are "bad", "wast", and "film". If the term "bad" occurs in a document, it is likely to have a negative sentiment. If "bad" does not occur but "wast" (stem of waste) it is again likely to score negatively, etc.

There are, however, several drawbacks to using only single words as features. Negations, such as "not bad" or "not good" for example, will not be taken into account. In fact single word features can even lead to misclassification. Using frequent n-grams as features in addition to single words can overcome this problem.

In this blog post we show an example of how the usage of 1- and 2-grams as features for sentiment prediction can increase the accuracy of the model in comparison with only single word features. The KNIME Text Processing extension is used again in combination with traditional KNIME learner and predictor nodes to process the textual data and build and score the predictive model.

How to Setup the Python Extension

With the new Python integration in KNIME you can use an existing Python installation from KNIME. But what if Python is not yet installed on the system? Here is a quick step by step guide on how to install Python and get it working in KNIME.