Week 14 Guide

Chapter 3: Clustering

Modified

April 8, 2026

Week 10 introduced scaling, which prepares features so that no single range dominates calculations. Week 12 introduced dimensionality reduction, which finds compact representations of high-dimensional data by removing redundant information. Scaling and dimensionality reduction both prepare data before doing something else with it. Week 14 introduces clustering, which does not prepare data for another step but instead finds structure in it directly.

Clustering is the task of finding groups in data without labels. Unlike supervised learning, clustering uses only features: no target column, no known correct groupings, and no way to score results against an external reference. This week’s reading and demo introduce three algorithms that approach the clustering task in different ways: k-Means, which places centers in the data and assigns each point to the nearest one; agglomerative clustering, which starts with every point as its own cluster and merges upward; and DBSCAN, which identifies dense regions and leaves isolated points unlabeled. The demo ends with a close look at how clustering results are evaluated and where the silhouette score, the most commonly used evaluation metric, produces misleading results.

The demo uses synthetic data generated by make_blobs and make_moons from sklearn.datasets. Synthetic data makes the concepts concrete: because the correct groupings are known in advance, students can see directly whether each algorithm found the right structure.

Week 14 Assignment:

Here is the link to the Week 14 assignments page.

Demo and textbook coverage

In the demo you will:

Apply KMeans from sklearn.cluster using the fit workflow and examine the labels_ and cluster_centers_ attributes
Use predict to assign new data points to clusters and interpret what the cluster numbers mean
Observe what happens when n_clusters does not match the actual structure of the data
See where k-Means fails: why it cannot find clusters with complex shapes, and what the failure reveals about how k-Means defines a cluster
Apply AgglomerativeClustering from sklearn.cluster using fit_predict, examine the hierarchical tree structure and ward linkage, and see why there is no predict method for new points
Apply DBSCAN from sklearn.cluster, set eps and min_samples, and observe how core, boundary, and noise points are classified
Scale data before DBSCAN and see why eps requires consistent feature ranges to be meaningful
Evaluate clustering results using silhouette_score from sklearn.metrics and examine a case where the score ranks a visually wrong clustering higher than a visually correct one

In the textbook you will read about:

k-Means failure cases in depth: clusters with different densities, elongated clusters, and the two_moons dataset
Vector quantization: k-Means as a decomposition method where each point is represented by its cluster center
Dendrograms: visualizing the full hierarchy of agglomerative clustering merges using SciPy
DBSCAN applied to a face image dataset for outlier detection
Adjusted rand index (ARI): an evaluation metric that requires known correct labels and how it compares to the silhouette score

Reading expectations

After completing the demo and reading, you should be able to explain the following in your own words:

How does clustering differ from classification? What does the absence of a target column change about what the algorithm can do and how results are evaluated?
k-Means assigns every point to a cluster and supports predict for new data. Why does agglomerative clustering have neither of those properties?
What do eps and min_samples each control in DBSCAN, and what happens to the clustering result when eps is set too small or too large?
Why must data be scaled before applying DBSCAN, but not before applying k-Means or agglomerative clustering?
The silhouette score gave k-Means a higher score than DBSCAN on the two_moons dataset, even though k-Means produced the wrong clustering. What does the silhouette score actually measure, and why does measuring compactness favor k-Means regardless of whether k-Means found the correct structure?
What is the adjusted rand index, and why can it be used to compare clustering algorithms in the textbook’s examples but not in most real-world applications?

Week 14 tasks

Read Chapter 3, clustering section (pages 166–207).
Work through the Week 14 demo in your Jupyter environment.
Complete the Week 14 D2L quiz.