In this tutorial, we will delve into the basics of Principal Component Analysis (PCA), a dimensionality reduction technique used in machine learning. PCA is widely used for data compression and feature extraction.
What is PCA?
PCA is a statistical procedure that uses orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
Key Points of PCA:
- Dimensionality Reduction: PCA reduces the dimensionality of the data, making it easier to visualize and analyze.
- Feature Extraction: PCA extracts the most significant features from the data.
- Data Compression: PCA can be used to compress the data without losing much information.
PCA Workflow
- Standardize the Data: The first step in PCA is to standardize the data. This means that we subtract the mean and divide by the standard deviation for each feature.
- Compute the Covariance Matrix: The next step is to compute the covariance matrix of the standardized data.
- Compute Eigenvectors and Eigenvalues: Eigenvectors and eigenvalues are computed from the covariance matrix.
- Sort Eigenvectors by Eigenvalues: The eigenvectors are sorted in descending order of their corresponding eigenvalues.
- Project Data onto Principal Components: Finally, the data is projected onto the first few principal components.
Example
Let's say we have a dataset with three features: X
, Y
, and Z
. We want to reduce the dimensionality of this dataset using PCA.
import numpy as np
# Generate some random data
data = np.random.randn(100, 3)
# Standardize the data
mean = np.mean(data, axis=0)
std = np.std(data, axis=0)
standardized_data = (data - mean) / std
# Compute the covariance matrix
cov_matrix = np.cov(standardized_data, rowvar=False)
# Compute eigenvectors and eigenvalues
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
# Sort eigenvectors by eigenvalues
sorted_indices = np.argsort(eigenvalues)[::-1]
sorted_eigenvectors = eigenvectors[:, sorted_indices]
# Project data onto the first two principal components
pca_data = np.dot(standardized_data, sorted_eigenvectors[:, :2])
Further Reading
For more information on PCA, you can refer to the following resources:
[center]
[center]