Week 14 Guide
Chapter 3: Clustering
Week 10 introduced scaling, which prepares features so that no single range dominates calculations. Week 12 introduced dimensionality reduction, which finds compact representations of high-dimensional data by removing redundant information. Scaling and dimensionality reduction both prepare data before doing something else with it. Week 14 introduces clustering, which does not prepare data for another step but instead finds structure in it directly.
Clustering is the task of finding groups in data without labels. Unlike supervised learning, clustering uses only features: no target column, no known correct groupings, and no way to score results against an external reference. This week’s reading and demo introduce three algorithms that approach the clustering task in different ways: k-Means, which places centers in the data and assigns each point to the nearest one; agglomerative clustering, which starts with every point as its own cluster and merges upward; and DBSCAN, which identifies dense regions and leaves isolated points unlabeled. The demo ends with a close look at how clustering results are evaluated and where the silhouette score, the most commonly used evaluation metric, produces misleading results.
The demo uses synthetic data generated by make_blobs and make_moons from sklearn.datasets. Synthetic data makes the concepts concrete: because the correct groupings are known in advance, students can see directly whether each algorithm found the right structure.
Week 14 Assignment:
- Here is the link to the Week 14 assignments page.
Demo and textbook coverage
In the demo you will:
- Apply
KMeansfromsklearn.clusterusing thefitworkflow and examine thelabels_andcluster_centers_attributes - Use
predictto assign new data points to clusters and interpret what the cluster numbers mean - Observe what happens when
n_clustersdoes not match the actual structure of the data - See where k-Means fails: why it cannot find clusters with complex shapes, and what the failure reveals about how k-Means defines a cluster
- Apply
AgglomerativeClusteringfromsklearn.clusterusingfit_predict, examine the hierarchical tree structure and ward linkage, and see why there is nopredictmethod for new points - Apply
DBSCANfromsklearn.cluster, setepsandmin_samples, and observe how core, boundary, and noise points are classified - Scale data before DBSCAN and see why
epsrequires consistent feature ranges to be meaningful - Evaluate clustering results using
silhouette_scorefromsklearn.metricsand examine a case where the score ranks a visually wrong clustering higher than a visually correct one
In the textbook you will read about:
- k-Means failure cases in depth: clusters with different densities, elongated clusters, and the two_moons dataset
- Vector quantization: k-Means as a decomposition method where each point is represented by its cluster center
- Dendrograms: visualizing the full hierarchy of agglomerative clustering merges using SciPy
- DBSCAN applied to a face image dataset for outlier detection
- Adjusted rand index (ARI): an evaluation metric that requires known correct labels and how it compares to the silhouette score
Reading expectations
After completing the demo and reading, you should be able to explain the following in your own words:
- How does clustering differ from classification? What does the absence of a target column change about what the algorithm can do and how results are evaluated?
- k-Means assigns every point to a cluster and supports
predictfor new data. Why does agglomerative clustering have neither of those properties? - What do
epsandmin_sampleseach control in DBSCAN, and what happens to the clustering result whenepsis set too small or too large? - Why must data be scaled before applying DBSCAN, but not before applying k-Means or agglomerative clustering?
- The silhouette score gave k-Means a higher score than DBSCAN on the two_moons dataset, even though k-Means produced the wrong clustering. What does the silhouette score actually measure, and why does measuring compactness favor k-Means regardless of whether k-Means found the correct structure?
- What is the adjusted rand index, and why can it be used to compare clustering algorithms in the textbook’s examples but not in most real-world applications?
Week 14 tasks
- Read Chapter 3, clustering section (pages 166–207).
- Work through the Week 14 demo in your Jupyter environment.
- Complete the Week 14 D2L quiz.