Dimensionality Reduction and Supervised Learning

For our final project in my Machine Learning and Pattern Recognition class, my partner and I investigated the effects of performing principal component analysis on a data set prior to feeding that data into a neural network; essentially, we wanted to perform a cost-benefit analysis for how preprocessing with unsupervised learning would impact the performance of a supervised learning model.

Principal component analysis, or PCA, is a form of dimensionality reduction that maintains the features with the highest amount of variance. In essence, it seeks to eliminate features that do little to differentiate the data, such as white space in pictures, or any variables that are fairly constant for all samples. For PCA, a fraction of the total variance called the explained variance is provided, and features from the higher dimension set are extracted (in decreasing order of individual variance) until the explained variance is satisfied.

For this project we worked with the MNIST data set, which contains 60,000 handwritten digits, stored as 28×28 grayscale pixelated images. Prior to performing PCA on this data, we vectorize each picture to an array of 784 integers indicating the brightness of the pixels.

Performing the inverse PCA transformation on the decomposed data provides an approximate reconstruction of the image. The GIF to the right shows how many features, or principal components, it takes to maintain the desired level of variance. The effects of PCA are easily visible here, as it essentially reduces the resolution of the image.

Reducing dimensionality in this way proved to have a minimal effect on the accuracy of the network; it was possible to reduce the data to less than 1/2 of its original size (maintaining 99% of the explained variance) at the expense of less than 2% accuracy. Additionally, the network tended to train faster against the reduced data than the full dimension set. Performing PCA as a preprocessing technique was conclusively beneficial for size and speed purposes, at the cost of classification accuracy.

Our paper contains more details on the results of our project.