Welcome to the SciPy Hierarchical Clustering guide! This tutorial will walk you through the fundamentals of hierarchical clustering using the scipy library in Python.


🧠 What is Hierarchical Clustering?

Hierarchical clustering is a method of grouping data points into clusters based on their similarity. Unlike k-means, it creates a tree-like structure (dendrogram) to represent the hierarchy of clusters.

Key Concepts:

  • Bottom-up approach: Merges the closest data points iteratively.
  • Top-down approach: Splits clusters recursively.
  • Linkage criteria: Determines how clusters are merged (e.g., single, complete, average).

🚀 Getting Started with SciPy

Here’s how to perform hierarchical clustering using scipy.cluster.hierarchy:

  1. Import Libraries

    import numpy as np  
    from scipy.cluster.hierarchy import dendrogram, linkage, cluster_optimal  
    from scipy.spatial.distance import pdist  
    import matplotlib.pyplot as plt  
    
  2. Prepare Data
    Use a sample dataset or your own data. For example:

    data = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])  
    
  3. Compute Linkage

    Z = linkage(data, method='ward')  # 'ward', 'single', 'complete', 'average'  
    
  4. Visualize Dendrogram

    dendrogram
    ```python plt.figure(figsize=(10, 7)) dendrogram(Z) plt.title("Dendrogram") plt.show() ```
  5. Cut the Tree
    Use fcluster to specify the number of clusters:

    from scipy.cluster.hierarchy import fcluster  
    clusters = fcluster(Z, t=3, criterion='maxclust')  
    print("Cluster labels:", clusters)  
    

📌 Practical Example

Let’s cluster a 2D dataset:

from sklearn.datasets import make_blobs  
X, y = make_blobs(n_samples=30, centers=4, random_state=42)  

# Compute linkage  
Z = linkage(X, method='complete')  

# Plot dendrogram  
plt.figure(figsize=(10, 7))  
dendrogram(Z)  
plt.title("Dendrogram for Complete Linkage")  
plt.show()  

# Cut tree into 4 clusters  
clusters = fcluster(Z, t=4, criterion='maxclust')  
print("Cluster assignments:", clusters)  
data visualization

📚 Extend Your Knowledge


🧪 Tips & Tricks

  • Use pdist for custom distance metrics.
  • Adjust method parameter to try different linkage strategies.
  • Explore cluster_optimal for optimal partitioning.

Let me know if you need help with custom implementations! 😊