The recent explosion of data set size, in number of records and attributes, has triggered the development of a number of big data platforms as well as parallel data analytics algorithms. At the same time though, it has pushed for usage of data dimensionality reduction procedures. Indeed, more is not always better. Large amounts of data might sometimes produce worse performances in data analytics applications.
One of my most recent projects happened to be about churn prediction and to use the 2009 KDD Challenge large data set. The particularity of this data set consists of its very high dimensionality with 15K data columns. Most data mining algorithms are column-wise implemented, which makes them slower and slower on a growing number of data columns. The first milestone of the project was then to reduce the number of columns in the data set and lose the smallest amount of information possible at the same time.
Using the project as an excuse, we started exploring the state-of-the-art on dimensionality reduction techniques currently available and accepted in the data analytics landscape.
We picked this chance to compare those techniques on the smaller data set of the 2009 KDD challenge in terms of reduction ratio, degrading accuracy, and speed. The final accuracy and its degradation depend, of course, on the model selected for the analysis. Thus, the compromise between reduction ratio and final accuracy is optimized against a bag of three specific models: decision tree, neural networks, and naïve Bayes.
Running the optimization loop, the best cutoffs, in terms of lowest number of columns and best accuracy, were determined for each one of the seven dimensionality reduction techniques and for the best performing model. The final best model performance, as accuracy and Area under the ROC Curve, was compared with the performance of the baseline algorithm using all input features. Results of this comparison are reported in the table below.
|Dimensionality Reduction||Reduction Rate||Accuracy on validation set||Best Threshold||AuC||Notes|
|Baseline||0%||73%||-||81%||Baseline models are using all input features|
|Missing Values Ratio||71%||76%||0.4||82%||-|
|Low Variance Filter||73%||82%||0.03||82%||Only for numerical columns|
|High Correlation Filter||74%||79%||0.2||82%||No correlation available between numerical and nominal columns|
|PCA||62%||74%||-||72%||Only for numerical columns|
|Random Forrest / Ensemble Trees||86%||76%||-||82%||-|
|Backward Feature Elimination + missing values ratio||99%||94%||-||78%||Backward Feature Elimination and Forward Feature Construction are prohibitively slow on high dimensional data sets. It becomes practical to use them, only if following other dimensionality reduction techniques, like here the one based on the number of missing values.|
|Forward Feature Construction + missing values ratio||91%||83%||-||63%|
Notice that the highest reduction ratio without performance degradation is obtained by analyzing the decision cuts in many random forests (Random Forests/Ensemble Trees). However, even just counting the number of missing values, measuring the column variance, and measuring the correlation of pairs of columns can lead to a satisfactory reduction rate while keeping performance unaltered with respect to the baseline models.
What we have learned from this little review exercise, is that dimensionality reduction is not only useful to speed up algorithm execution, but also to improve model performance. The Area under the Curve (AuC) in the table shows a slight increase on the test data, when the missing value ratio, the low variance filter, the high correlation filter criteria, or the random forests are applied.
Indeed, in the era of big data, when more is axiomatically better, we have re-discovered that too many noisy or even faulty input data columns often lead to a less than desirable algorithm performance. Removing un-informative or even worse dis-informative input attributes might help build a model on more extensive data regions, with more general classification rules, and overall with better performances on new unseen data.
Recently, we asked data analysts on a LinkedIn group (https://www.linkedin.com/grp/post/35222-5998794653007171586) for the most used dimensionality reduction techniques, besides the seven described in this blog post. The answers involved Random Projections, NMF, (Stacked) Auto-encoders, Chi-square or Information Gain, Multidimensional Scaling, Correspondence Analysis, Factor Analysis, Clustering, and Bayesian Models. Thanks to Asterios Stergioudis, Raoul Savos, and Michael Will who provided the suggestions on the LinkedIn group.
The workflows described in this blog post are available on the KNIME EXAMPLES server under 003_Preprocessing/003005_dimensionality_reduction.
Both small and large data sets from the 2009 KDD Challenge can be downloaded from http://kdd.org/kdd-cup/view/kdd-cup-2009/Data.
This is just a brief summary of the whole project. If you are interested in all the tiny details, you can always read the related whitepaper, in the Whitepapers section on the KNIME web site: https://www.knime.org/files/knime_seventechniquesdatadimreduction.pdf
Below are the ROC curves for all the evaluated dimensionality reduction techniques and the best performing machine learning algorithm. The value of the area under the curve is shown in the legend.