Welcome to the SciPy Hierarchical Clustering guide! This tutorial will walk you through the fundamentals of hierarchical clustering using the scipy
library in Python.
🧠 What is Hierarchical Clustering?
Hierarchical clustering is a method of grouping data points into clusters based on their similarity. Unlike k-means, it creates a tree-like structure (dendrogram) to represent the hierarchy of clusters.
Key Concepts:
- Bottom-up approach: Merges the closest data points iteratively.
- Top-down approach: Splits clusters recursively.
- Linkage criteria: Determines how clusters are merged (e.g., single, complete, average).
🚀 Getting Started with SciPy
Here’s how to perform hierarchical clustering using scipy.cluster.hierarchy
:
Import Libraries
import numpy as np from scipy.cluster.hierarchy import dendrogram, linkage, cluster_optimal from scipy.spatial.distance import pdist import matplotlib.pyplot as plt
Prepare Data
Use a sample dataset or your own data. For example:data = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
Compute Linkage
Z = linkage(data, method='ward') # 'ward', 'single', 'complete', 'average'
Visualize Dendrogram
```python plt.figure(figsize=(10, 7)) dendrogram(Z) plt.title("Dendrogram") plt.show() ```Cut the Tree
Usefcluster
to specify the number of clusters:from scipy.cluster.hierarchy import fcluster clusters = fcluster(Z, t=3, criterion='maxclust') print("Cluster labels:", clusters)
📌 Practical Example
Let’s cluster a 2D dataset:
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=30, centers=4, random_state=42)
# Compute linkage
Z = linkage(X, method='complete')
# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z)
plt.title("Dendrogram for Complete Linkage")
plt.show()
# Cut tree into 4 clusters
clusters = fcluster(Z, t=4, criterion='maxclust')
print("Cluster assignments:", clusters)
📚 Extend Your Knowledge
🧪 Tips & Tricks
- Use
pdist
for custom distance metrics. - Adjust
method
parameter to try different linkage strategies. - Explore
cluster_optimal
for optimal partitioning.
Let me know if you need help with custom implementations! 😊