PCA for Dimensionality Reduction

Intermediate 10 min read

What you'll learn

✓What principal components actually are
✓How variance relates to information
✓When PCA helps and when it hurts
✓How to choose the number of components
✓Common preprocessing mistakes

Prerequisites

•Basic linear algebra
•Familiarity with feature vectors

High-dimensional data is hard to model, hard to visualize, and hard to reason about. Principal Component Analysis is the classic answer: rotate your features into a new coordinate system where most of the variation lives along just a few axes. This post explains how PCA works and how to use it without quietly destroying your signal.

What it is and why use it

PCA is an unsupervised linear transformation that finds new orthogonal axes, called principal components, ordered by how much variance in the data they capture. The first component points in the direction of maximum spread, the second in the next best direction perpendicular to the first, and so on.

Reasons to use PCA include compressing data for storage, speeding up downstream models, visualizing high-dimensional structure in two or three dimensions, and reducing multicollinearity before fitting linear models. It is also a useful diagnostic: a steep variance curve tells you the data lives on a low-dimensional manifold.

Mental model

Imagine a cloud of points shaped like a stretched cigar in three dimensions. The long axis of the cigar carries most of the information about where points sit. The two short axes barely matter. PCA finds that long axis automatically, then the next-most-spread axis, and lets you project onto whichever subset you keep.

Mathematically, PCA computes the eigenvectors of the data covariance matrix. The eigenvectors are the directions, and the eigenvalues tell you how much variance each direction captures. You can also get the same answer from the singular value decomposition of the centered data matrix, which is what most libraries do for stability.

Hands-on example

Suppose you have a thousand customers described by fifty behavioral features. You want a two-dimensional plot to look for clusters. You center the data, compute the SVD, take the top two right singular vectors, and project.

raw features (50 dims)
    |
 center & scale
    |
    v
 covariance matrix  -> eigendecomposition
                          |
                          v
                 sort eigenvectors by eigenvalue
                          |
                          v
          keep top 2 -> project data -> 2D scatter

PCA projection from 50D to 2D

In scikit-learn this is two lines: StandardScaler then PCA with n_components=2. Plot the result and you often see meaningful clusters that were invisible in the original space. The explained_variance_ratio_ attribute tells you what fraction of the original variance you kept.

Trade-offs

PCA is linear. If the structure in your data lies along a curved manifold, principal components will smear it across many dimensions. Methods like t-SNE, UMAP, or kernel PCA handle nonlinear structure better, at higher computational cost.

It also assumes that variance equals information. A feature with tiny variance but a strong relationship to your target will be discarded. PCA is unsupervised, so it has no idea what you actually care about predicting.

Interpretation suffers. A principal component is a weighted sum of original features, often with no clean meaning. If stakeholders need to understand individual feature contributions, PCA can make life harder, not easier.

Practical tips

Always standardize features before PCA. Otherwise a feature measured in dollars will dominate over one measured in counts purely because of scale.

Decide on the number of components using a scree plot or a cumulative explained variance curve. A common rule is to keep enough components to retain ninety to ninety-five percent of variance.

Fit PCA only on the training set, then transform both train and test. Fitting on the full dataset leaks information and inflates evaluation metrics.

Try PCA before clustering or before training distance-based models. They benefit most from a tight, denoised feature space.

For very large datasets, use IncrementalPCA or randomized SVD. Full eigendecomposition is too slow above a few thousand features.

Wrap-up

PCA is the workhorse of dimensionality reduction. It is fast, well understood, and often enough. Use it for visualization, for preprocessing, and as a sanity check on the intrinsic dimensionality of your data. When linear projection is not enough, you will know, because the variance curve will refuse to flatten and clusters will refuse to separate.