Week 13 Assignment (CMSC-2208)
Dimensionality Reduction Application and Video Reflection
Submission location: All items are submitted in D2L (Week 13 dropbox).
The Scenario
A game studio is building a personalization system for its flagship multiplayer title. The studio has identified four player archetypes and wants to use those archetypes to deliver personalized in-game content recommendations:
- Explorer
- Competitor
- Socializer
- Achiever
A prior survey-based study established archetype labels for a subset of the player base, giving the team a labeled dataset to work with.
A data scientist has been brought in to build the classifier. They have assembled 90 days of behavioral telemetry on 2,000 labeled players and plan to use kNN to predict which archetype each player belongs to. Before training, they reviewed the feature table and noticed that the features span dramatically different numeric ranges. They applied StandardScaler to address that and are now asking whether the dataset is ready for training.
There is one more thing the data scientist has flagged. Looking at the features, several of them appear to track overlapping aspects of player behavior. Players who spend more time in the game tend to complete more missions. Players who join more party sessions tend to send more chat messages. The data scientist suspects this overlap is worth addressing before training but is not sure how.
Features recorded for each player:
| Feature | Description | Range |
|---|---|---|
hours_per_week |
Average hours played per week | 1–62 |
missions_completed |
Total missions completed in 90 days | 8–1,847 |
deaths_per_session |
Average deaths per play session | 0.2–28.4 |
items_purchased |
Total in-game items purchased | 0–634 |
chat_messages_sent |
Total chat messages sent | 0–9,203 |
party_sessions |
Multiplayer party sessions joined | 0–412 |
map_regions_visited |
Unique map regions visited | 3–287 |
achievements_unlocked |
Total achievements unlocked | 1–318 |
distance_traveled_km |
Total in-game distance traveled | 0.4–18,293 |
difficulty_rating |
Average self-selected difficulty (1–5 scale) | 1.0–5.0 |
daily_login_streak |
Longest login streak in days | 1–90 |
Here is a sample of the data:
| hours_per_week | missions_completed | deaths_per_session | items_purchased | chat_messages_sent | party_sessions | map_regions_visited | achievements_unlocked | distance_traveled_km | difficulty_rating | daily_login_streak |
|---|---|---|---|---|---|---|---|---|---|---|
| 42 | 1,203 | 3.1 | 287 | 142 | 38 | 94 | 241 | 8,847 | 3.5 | 74 |
| 11 | 89 | 1.4 | 23 | 7,841 | 389 | 19 | 22 | 412 | 2.0 | 31 |
| 38 | 614 | 5.8 | 91 | 67 | 12 | 261 | 88 | 17,204 | 2.5 | 58 |
| 29 | 743 | 22.4 | 412 | 203 | 67 | 41 | 134 | 3,102 | 4.8 | 45 |
| 6 | 44 | 0.9 | 8 | 31 | 4 | 11 | 9 | 183 | 1.5 | 12 |
| 19 | 381 | 4.2 | 156 | 1,204 | 144 | 72 | 67 | 2,891 | 3.0 | 43 |
Notice that distance_traveled_km reaches into the tens of thousands while difficulty_rating tops out at 5.0. chat_messages_sent exceeds 7,800 in one record while deaths_per_session stays in the single digits. Both are valid behavioral measurements, but their numeric scales differ by orders of magnitude.
Your Task
Record a video in which you advise the data scientist throughout the project. Address each of the four sections below in order. You are acting as a consultant who has reviewed the data, the plan, and the results. Explain your reasoning clearly, use correct terminology, and connect your answers to the specific features and scenario described.
Do not read from a script. Focus on demonstrating that you understand the concepts and can apply them to this situation.
Requirements
- Clear video and audio quality
- Intro (required): Start your video by saying: “Hello, my name is [Your Name]. This is the Week 13 Dimensionality Reduction assignment for CMSC 2208.”
Section A: Is Scaling Enough?
The data scientist has reviewed the feature table and noticed the scale differences. They send you the following message:
“I’ve applied
StandardScalerto all 11 features. The data is scaled and I’m ready to train kNN. Is there anything else I need to do before training?”
- Look at the feature table and identify at least two pairs of features you would expect to be correlated. For each pair, explain the behavioral logic. Why would those two measurements tend to move together across players?
- The data scientist says they are ready to train. Are they? Identify what they are missing and explain why scaling alone does not fully prepare this dataset for kNN. Use the correlation pairs you identified to support your answer, and be specific about what problem scaling solved and what different problem remains.
Section B: A Flaw in the Plan
After your conversation, the data scientist sends you their revised preprocessing plan:
“Here is what I am going to do. First I will split the data into training and test sets. Then I will fit
StandardScaleron the training data and transform both sets. Then, since more data should give me better components, I will fitPCAon the combined scaled training and test data before transforming each set separately.”
- There is a mistake in this plan. Your answer should cover all four of the following: identify the mistake precisely, explain what goes wrong as a result, describe what the correct sequence looks like, and address what
PCAlearns during the fit step and why that makes fitting on combined data a problem. - The data scientist asks why they cannot skip
StandardScalerand just fitPCAdirectly on the raw features to save a step. Look at the feature ranges in the table and name the specific features that make this shortcut most damaging. Explain what happens to the principal components as a result.
Section C: How Many Components?
The data scientist followed your recommendations and fit PCA on the scaled training data. They share the following explained variance table and ask for your advice:
| Component | Individual | Cumulative |
|---|---|---|
| PC1 | 29.3% | 29.3% |
| PC2 | 18.7% | 48.0% |
| PC3 | 12.4% | 60.4% |
| PC4 | 9.1% | 69.5% |
| PC5 | 7.6% | 77.1% |
| PC6 | 5.8% | 82.9% |
| PC7 | 4.3% | 87.2% |
| PC8 | 3.9% | 91.1% |
| PC9 | 3.4% | 94.5% |
| PC10 | 3.1% | 97.6% |
| PC11 | 2.4% | 100.0% |
- The data scientist needs to choose a value for
n_componentsbefore training kNN. What do you recommend and why? State the tradeoff you are accepting and connect your reasoning to what the table shows. - The data scientist’s manager reviews the plan and says: “Just keep all 11 components so you do not lose any information.” How do you respond? What does the manager misunderstand about what the later components contain?
Section D: Results and Next Steps
The data scientist trains two kNN models and shares the results:
- kNN on all 11 scaled features: 0.847
- kNN on 5 PCA components (77.1% variance retained): 0.891
The studio also has two additional requests. The creative director wants a visualization showing how the four player archetypes cluster before the model is deployed. The data scientist wants to understand what the PCA components actually represent in terms of the original player behaviors. Questions 2 and 3 below address each of these requests in turn.
- The data scientist is surprised by the accuracy results. They expected that using more features would give kNN more information to work with. Explain why the PCA-reduced model outperformed the full-feature model. Be specific about what the discarded components contain and what effect that has on how kNN computes distances.
- For the creative director’s request, the data scientist plans to reduce to 2 PCA components and plot the archetypes in a scatter plot. A colleague suggests there is a method better suited to this specific task. Advise the data scientist: what is the better method and what does it offer for visualization? Then explain why that same method cannot be used as the preprocessing step in the production classifier. Specifically, what happens when a new player joins the game after the model has been deployed?
- For the data scientist’s interpretability request, what would you recommend they use to connect the PCA components back to the original 11 player behaviors? Explain what it does and what the result reveals.
D2L Submission Checklist
Submit the following items to the Week 13 D2L dropbox.
Video submission (link only)
- Video filename:
lastname_Week13DimReduction - Upload your video to Kaltura
- Do not embed the video in the Dropbox submission
- Paste the share link into the D2L text submission box
- Make sure the link text includes
lastname_Week13DimReductionin the link name