The Cure for the Curse of Dimensionality: PCA

2022-08-21

What Do We Mean by “Curse of Dimensionality”?

In statistics, dimensionality is the number of features a dataset has. [1]
Three features (e.g., height, age, favorite number) → a 3D space; each person is a point. Visualizing is fine in 2D/3D, but quickly becomes impractical in 10D+. As dimensionality grows:

Distances concentrate and neighborhoods become sparse.
You need far more samples to cover the space.
Models get slower and more brittle.

In genomics or other ultra-wide datasets, this “curse” is real. We often fight it with dimensionality reduction—compressing data into fewer, more informative dimensions while preserving structure. Classic tools include PCA, LDA, t-SNE, and autoencoders. [2–4]

PCA in One Paragraph

Principal Component Analysis (PCA) finds a new orthogonal basis for your data such that the first axis captures the largest variance, the second the next largest (subject to orthogonality), and so on. You can then project your data onto the first k components and work in k dimensions with minimal information loss (variance-wise). [5]

Intuition: if height doesn’t help you predict favorite food, PCA can down-weight or remove it.

A Minimal scikit-learn Example

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

# X: shape (n_samples, n_features)
X = np.array([
    [1.8, 25, 70],
    [1.6, 30, 60],
    [1.9, 22, 75],
    [1.7, 28, 68],
    [1.8, 26, 72],
], dtype=float)

# 1) Standardize (important for PCA)
X_std = StandardScaler().fit_transform(X)

# 2) Fit PCA to keep top-2 components
pca = PCA(n_components=2, random_state=0)
Z = pca.fit_transform(X_std)

print("Explained variance ratio:", pca.explained_variance_ratio_)  # how much each PC explains
print("Components (rows are PCs):\n", pca.components_)             # eigenvectors in feature space
print("Projected data shape:", Z.shape)                            # (n_samples, 2)

Tip: Always scale features first (e.g., StandardScaler); otherwise features with large units dominate the covariance. [6, 7]

How PCA Works — Step by Step

1) Standardize & Center the Data

Let X be an n x d matrix (rows = samples, columns = features).
Subtract the mean of each column so the data is zero-centered.

2) Compute the Covariance Matrix

Use: Σ = (1 / (n - 1)) * X^T * X

Σ is d x d
Diagonals = feature variances
Off-diagonals = covariances [6, 7]

3) Eigenvalues & Eigenvectors

Solve the eigenproblem for Σ to obtain:

Eigenvalues (λ): variance captured
Eigenvectors (v): principal directions
Normalize eigenvectors to unit length. [8–10]

4) Rank & Select Components

Sort eigenpairs by descending eigenvalue (λ).
Keep the top k eigenvectors.
Stack them into a projection matrix W_k.

5) Project the Data

Transform centered data:
Z = X * W_k^T

Z is the new low-dimensional representation.
The first k PCs preserve the maximum variance.

References

[1] Dimensionality, StatisticsHowTo
[2] Application of PCA to medical data, Indian Journal of Science and Technology
[3] Gülsan Öğündür, PCA (Turkish overview)
[4] Jolliffe/Jackson, A User’s Guide to Principal Components (excerpt)
[5] Wikipedia — Principal Component Analysis (overview)
[6] 5 Things You Should Know About Covariance, Towards Data Science
[7] R Statistics Cookbook (O’Reilly) — covariance section
[8] Wikipedia — Eigenvalues and Eigenvectors
[9] Math StackExchange: Importance of eigenvalues/eigenvectors
[10] MathsIsFun — Eigenvalues and Eigenvectors