Outlier detection is a critical step in data preprocessing. Here are common methods and their implementation in Python:
1. Z-Score Method 📈
import numpy as np
from scipy import stats
# Generate sample data
data = np.random.normal(0, 1, 1000)
outliers = np.abs(stats.zscore(data)) > 3
2. IQR Method ✅
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
outliers = (data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))
3. DBSCAN Clustering 🌐
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(data.reshape(-1,1))
outliers = clusters == -1
4. Isolation Forest 🌲
from sklearn.ensemble import IsolationForest
iso_forest = IsolationForest(contamination=0.1)
outliers = iso_forest.fit_predict(data.reshape(-1,1)) == -1
5. PCA-Based Detection 🔄
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)
transformed = pca.fit_transform(data)
outliers = np.abs(transformed) > 3
6. Visualization Tools 🖼️
- Use
matplotlib
for scatter plots - Apply
seaborn
for boxplots - Combine with
plotly
for interactive dashboards
For advanced techniques like autoencoders or isolation trees, check our guide on Advanced Outlier Detection Methods 🔍