This section provides practical exercises to apply the unsupervised learning techniques discussed in this chapter. You'll use scikit-learn
to implement K-Means and DBSCAN for clustering, and a common technique for anomaly detection on a generated dataset. This hands-on experience will help solidify your understanding of how these algorithms work and how to interpret their results.
Let's start by generating a synthetic dataset. We'll use make_blobs
from scikit-learn
to create distinct groups of points, and then add some randomly scattered points that can be considered outliers or noise.
import numpy as np
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
import plotly.express as px
# Generate sample data
X, y_true = make_blobs(n_samples=400, centers=4,
cluster_std=0.80, random_state=42)
# Add some noise points far from the clusters
rng = np.random.RandomState(42)
n_outliers = 30
outliers = rng.uniform(low=np.min(X) - 5, high=np.max(X) + 5, size=(n_outliers, 2))
X = np.vstack([X, outliers])
# Standardize the features for algorithms sensitive to scale (like K-Means and DBSCAN)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Create a DataFrame for easier visualization
df = pd.DataFrame(X_scaled, columns=['Feature 1', 'Feature 2'])
# Initial visualization of the data
fig_initial = px.scatter(df, x='Feature 1', y='Feature 2',
title='Synthetic Dataset with Potential Outliers',
color_discrete_sequence=['#495057']) # Use gray color for unclustered points
fig_initial.update_layout(showlegend=False)
# fig_initial.show() # Display the plot in a Python environment
Initial scatter plot of the generated dataset features after scaling. The distinct groups are visible, along with some scattered points.
K-Means aims to partition the data into k distinct, non-overlapping clusters. Each data point belongs to the cluster with the nearest mean (cluster centroid). We need to specify the number of clusters, k. Based on the visualization (and the way we generated the data), k=4 seems like a reasonable starting point.
from sklearn.cluster import KMeans
# Instantiate and fit K-Means
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10) # n_init='auto' or 10 for future versions
kmeans.fit(X_scaled)
# Get cluster assignments and centroids
df['KMeans Cluster'] = kmeans.labels_.astype(str) # Convert to string for discrete colors
centroids = scaler.inverse_transform(kmeans.cluster_centers_) # Transform centroids back to original scale
# Visualize K-Means results
fig_kmeans = px.scatter(df, x='Feature 1', y='Feature 2',
color='KMeans Cluster',
title='K-Means Clustering Results (k=4)',
color_discrete_sequence=px.colors.qualitative.Pastel) # Use a nice color sequence
# Add centroids to the plot (transformed back to scaled coordinates for plotting)
fig_kmeans.add_scatter(x=kmeans.cluster_centers_[:, 0], y=kmeans.cluster_centers_[:, 1],
mode='markers', marker=dict(color='#d6336c', size=12, symbol='x'),
name='Centroids')
# fig_kmeans.show()
K-Means clustering results with k=4. Points are colored by their assigned cluster, and cluster centroids are marked with 'x'. Notice how the outliers get assigned to the nearest cluster.
K-Means successfully identifies the main groups, but it forces every point, including the clear outliers we added, into one of the clusters. This happens because K-Means assumes clusters are spherical and assigns every point to the closest centroid.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups together points that are closely packed, marking as outliers points that lie alone in low-density regions. It doesn't require specifying the number of clusters beforehand but relies on two parameters: eps
(the maximum distance between two samples for one to be considered as in the neighborhood of the other) and min_samples
(the number of samples in a neighborhood for a point to be considered as a core point).
Choosing appropriate eps
and min_samples
often requires some experimentation or domain knowledge. Let's try some values. A smaller eps
or larger min_samples
will result in more points being classified as noise.
from sklearn.cluster import DBSCAN
# Instantiate and fit DBSCAN
# These parameters might need tuning depending on the dataset density
dbscan = DBSCAN(eps=0.3, min_samples=5)
dbscan.fit(X_scaled)
# Get cluster assignments (-1 indicates noise/outliers)
df['DBSCAN Cluster'] = dbscan.labels_.astype(str) # Convert to string for discrete colors
# Visualize DBSCAN results
fig_dbscan = px.scatter(df, x='Feature 1', y='Feature 2',
color='DBSCAN Cluster',
title=f'DBSCAN Clustering Results (eps={dbscan.eps}, min_samples={dbscan.min_samples})',
color_discrete_map={"-1": "#adb5bd"}, # Gray color for noise
category_orders={"DBSCAN Cluster": sorted(df['DBSCAN Cluster'].unique(), key=int)}, # Ensure -1 is first
color_discrete_sequence=px.colors.qualitative.Pastel) # Colors for actual clusters
# fig_dbscan.show()
DBSCAN clustering results. Points labeled '-1' (gray) are identified as noise/outliers because they don't belong to any dense region according to the chosen
eps
andmin_samples
.
Compare this to the K-Means plot. DBSCAN successfully identifies the four main clusters and, importantly, flags most of the synthetic outliers (and potentially some points on the fringes of the main clusters) as noise (cluster label -1). This ability to find noise points is a significant advantage of density-based clustering when dealing with datasets containing outliers.
While DBSCAN inherently identifies noise points which can be considered anomalies, other algorithms are specifically designed for anomaly detection. Let's try the Isolation Forest algorithm. It works by randomly partitioning the data and explicitly identifying observations that are isolated, meaning they require fewer partitions to be separated from the rest.
from sklearn.ensemble import IsolationForest
# Instantiate and fit Isolation Forest
# 'contamination' is the expected proportion of outliers, set to 'auto' or a specific value
# Let's estimate based on the number of noise points we added (30 / 430) ~ 0.07
iso_forest = IsolationForest(contamination=0.07, random_state=42)
iso_forest.fit(X_scaled)
# Predict anomalies (-1 for anomalies, 1 for inliers)
df['Anomaly'] = iso_forest.predict(X_scaled)
df['Anomaly'] = df['Anomaly'].map({1: 'Inlier', -1: 'Anomaly'}) # Map to readable labels
# Visualize Anomaly Detection results
fig_anomaly = px.scatter(df, x='Feature 1', y='Feature 2',
color='Anomaly',
title='Anomaly Detection using Isolation Forest',
color_discrete_map={'Inlier': '#1f77b4', 'Anomaly': '#d62728'}, # Standard blue, distinct red
category_orders={"Anomaly": ["Inlier", "Anomaly"]}) # Ensure consistent legend order
# fig_anomaly.show()
Isolation Forest results, highlighting points identified as anomalies (red).
The Isolation Forest identifies many of the same points as DBSCAN's noise points. However, the exact set might differ based on the algorithm's logic and parameters (like the contamination
factor). Isolation Forest is specifically designed to find outliers, while DBSCAN finds them as a byproduct of identifying dense regions. Depending on the specific goal, one might be preferred over the other.
This practice session demonstrated how to apply K-Means and DBSCAN for clustering and Isolation Forest for anomaly detection. You saw how K-Means assigns all points to clusters, while DBSCAN can identify noise. Isolation Forest provides a targeted approach for finding outliers. Experimenting with parameters (k
for K-Means, eps
and min_samples
for DBSCAN, contamination
for Isolation Forest) is often necessary to achieve the desired results for a specific dataset and analysis goal.
© 2025 ApX Machine Learning