Dimensionality Reduction

Jonathan Schein
3 min readOct 30, 2020

--

Oftentimes, you will encounter machine learning problems that have many features, even up to the millions. This is an issue for two reasons, the first being that the more features the slower the program will run. Second, it will be very difficult to find the best solution. Most people refer to this issue as the curse of dimensionality.

Curse of Dimensionality

Thankfully, there is a solution to this problem and there is a way to reduce the number of features in a machine learning problem. Of course, this also comes at price. Even though you can reduce the number of features, your system will likely perform slightly worse than it otherwise would have. There is a possibility that dimensionality reduction will get rid of noise and unnecessary features, but usually it will just speed up training. The more dimensions the training set has, the greater the risk of overfitting it. This is a reason why it is always good to increase the size of the training set.

Data visualization

Reducing the number of features in a machine learning problem is great for data visualization. If it’s possible to reduce the features down to just a few then you can plot the training set on the graph and hopefully get to some conclusions visually detecting patterns and/or clusters.

Algorithms

There are two main approaches to reducing dimensionality: projection and Manifold Learning. Now we are going to discuss a dimensionality reduction algorithm that uses these techniques. The most popular dimensionality reduction algorithm is called Principal Component Analysis or PCA. The first step in PCA is to identify the hyperplane that lies closest to the data and then it projects the data onto it. PCA identifies the axis that accounts for the largest amount of variance in the training set. The principal component is known as a unit vector. So the unit vector that represents the ith axis is called the ith principal component.

Going to d dimensions

Once you know all of the principal components, you can do dimensionality reduction down to d dimensions by projecting it onto the hyperplane. This hyperplane will preserve as much variance as possible.

Explained Variance Ratio

The explained_variance_ratio_ variable tells us how much variance is in each axis of each principal component. For example, if 84% of the datas variance is in the first axis and 14% is in the second axis, then we know that the third axis will carry very little information. Generally, you want the variance to add up to about 95% and that’s how you know how many dimensions to consider. But, for data visualization, you should only reduce down to 2 or 3 dimensions.

--

--

Jonathan Schein
Jonathan Schein

Written by Jonathan Schein

Data Scientist, Brandeis University Alum and Flatiron School Alum

No responses yet