In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?
Follow us here and send us your ideas for the next data blending challenge you’d like to see at firstname.lastname@example.org.
Today we will push the limits by attempting to blend data from not just 2 or 3, but 6 databases!
These 6 SQL and noSQL databases are among the top 10 most used databases, as listed in most database comparative web sites (see DB-Engines Ranking, The 10 most popular DB Engines …, Top 5 best databases). Whatever database you are using in your current data science project, there is a very high probability that it will be in our list today. So, keep reading!
What kind of use case is going to need so many databases? Well, actually it’s not an uncommon situation. For this experiment, we borrowed the use case and data sets used in the Basic and Advanced Course on KNIME Analytics Platform. In this use case, a company wants to use past customer data (behavioral, contractual, etc…) to find out which customers are more likely to buy a second product. This use case includes 6 datasets, all related to the same pool of customers.
The goal of this experiment is to retrieve the data from all of these data sources, blend them together, and train a model to predict the likelihood of a customer buying a second product.
The blending challenge of this experiment is indeed an extensive one. We want to collect data from all of the following databases: MySQL, MongoDB, Oracle, MariaDB, MS SQL Server, and PostgreSQL. Six databases in total: five relational databases and one noSQL database.
Will they all blend?
Let’s start by connecting to all of these databases and retrieving the data we are interested in.
Data retrieval from all relational SQL-powered databases follows a single pattern:
It is comforting that this approach - connect to database, select table, and extract data – works for all relational databases. It is equally comforting that the Database Connector node can reach out to any database. This means indeed that with this schema and with the right JDBC driver file I can connect to all existing databases, including vintage versions or those of rare vendors.
Connecting to a NoSQL database, such as MongoDB, follows a different node sequence pattern.
In KNIME Labs, a MongoDB sub-category hosts a few nodes that allow you to perform basic database operations on a MongoDB database. In particular, the MongoDB Reader node connects to a MongoDB database and extracts data according to the query defined in its configuration window.
Credentials here are required within the configuration window and it is not possible to provide them via the Credentials Input node or the Workflow Credentials option.
Data retrieved from a MongoDB database are encapsulated in a JSON structure. No problem. This is nothing that the JSON to Table node cannot handle. At the output of the JSON to Table node, the data retrieved from the MongoDB database are then made available for the next KNIME nodes.
Figure 1. This is the part of the workflow that blends data from six different databases: MySQL, MongoDB, SQL Server, Oracle, MariaDB, and PostgreSQL.
Most of our six datasets contain information about all of the customers. Only the web activity datasets alone do not cover all customers. However, together they do. The old web activity dataset is concatenated with the new web activity dataset. After that, all data coming from all of the different data sources are adjusted, renamed, converted, and joined so that one row represents one customer, where the customer is identified by its unique customer key.
The resulting dataset is then partitioned and used to train a machine learning model. As machine learning model, we chose a random forest with 100 trees and we trained it to predict the value in “Target” column. “Target” is a binary feature representing whether a customer bought a second product. So training the model to predict the value in “Target” means that we are training the model to produce the likelihood of a customer to buy a second product, given all that we already know about her/him.
The model is then applied to the test set and its performance evaluated with a Scorer node. The model accuracy was calculated to be around 77%.
A very frequent task in data analytics projects is to determine the influence of the input features on the trained model. There are many solutions to that, which also depend on the kind of predictive model that has been adopted.
Backward Feature Elimination starts with all N input features and progressively removes one to see how this affects the model performance. The input feature whose removal lowers the model’s performance the least is left out. This step is repeated until the model’s performance is worsened considerably. The subset of input features producing a high accuracy (or a low error) represents the subset of most influential input features. Of course, the definition of high accuracy (or low error) is arbitrary. It could mean the highest accuracy or a high enough accuracy for our purposes.
The metanode, named “Backward Feature Elimination” and available in the Node Repository under KNIME Labs/Wide Data/Feature Selection, implements exactly this procedure. The final node in the loop, named “Feature Selection Filter”, produces a summary of the model performance, for all steps where the input feature with lowest influence had been removed.
Remember that the Backward Feature Elimination procedure becomes slower with the higher number of input features. It works well with a limited number of input features, but avoid using it to investigate hundreds of them.
In addition, a random forest offers a higher degree of interpretability with respect to other machine learning models. One of the output ports of the Random Forest Learner node provides the number of times an input feature has been the candidate for a split and the number of times it has actually been chosen for the split, for levels 0, 1, and 2 across all trees in the forest. For each input feature, we subsequently defined a heuristic measure of influence, borrowed from the KNIME whitepaper “Seven Techniques for Data Dimensionality Reduction”, as:
influence index = Sum(# splits) / sum(# candidates)
The input features with highest influence indices are the most influential ones on the model performance.
Figure 2. Content of the metanode “Backward Feature Elimination” adapted for a random forest predictive model.
The final workflow is shown in figure 3 and it is downloadable from the KNIME EXAMPLES server under: 01_Data_Access/02_Databases/08_Database_Jam_Session01_Data_Access/02_Databases/08_Database_Jam_Session*.
In figure 3 you can see the five parts of our workflow: Credentials Definition, Database Connections and Data Retrieval, Data Blending to reach one single data table, Predictive Model Training, and Influence Measure of Input Features.
Figure 3. This workflow blends data from 6 different databases: MySQL, MongoDB, SQL Server, Oracle, MariaDB, and PostgreSQL. The blended dataset is used to train a model to predict customer’s likelihood to buy a second product. The last nodes measure input features’ influence on the final predictions.
Yes, data from all of these databases do blend!
In this experiment, we blended data from six, SQL based and noSQL, databases - Oracle, MongoDB, MySQL, MariaDB, SQL Server, and PostgreSQL – to reach one single data table summarizing all available information about our customers.
In this same experiment, we also trained a random forest model to predict the likelihood of a customer buying a second product.
Finally, we measured each input feature’s influence on the final predictions, using a Backward Feature Elimination procedure and a heuristic influence measure based on the numbers of splits and candidates in the random forest. Results from both procedures are shown in figures 4 and 5. Both figures show the prominent role of Age and Estimated Yearly Income and the negligible role of Gender, when predicting whether customer will buy a second product.
Figure 4. Bar Rendering of the influence indices calculated for all input features.
Figure 5. Accuracy for subsets of input features from the configuration window of the Feature Selection Filter node.
This whole predictive and influence analysis was made possible purely because of the data blending operation involving the many different database sources. The main result is therefore another yes! Data can be retrieved from different databases and they all blend!
The data and use case for this post are from the basic and advanced course on KNIME Analytics Platform. The course, naturally, covers much more and goes into far more detail than what we have had the chance to show here.
If you enjoyed this, please share it generously and let us know your ideas for future blends.
We’re looking forward to the next challenge. There we will tackle Teradata databases and KNIME table files. Will they blend?
* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)