Playtest Design

Purpose: Get useful signal from playtests. Most playtest sessions are wasted — observers confirm what they already believe, ask leading questions, and draw conclusions from noise. This skill provides structured methods to avoid those traps.

Influences: Frameworks here draw on cognitive UX research methodology, metrics-driven iterative design practice, and experience engineering theory (emergent behavior observation, planning under uncertainty).

When to Activate

Use this skill when:

Planning a playtest session (what to test, who to recruit, what to measure)
Designing post-playtest surveys or interview questions
Setting up analytics/metrics for ongoing data collection
Interpreting playtest results and deciding what to change
Resolving team disagreements about what the data shows

Core Principle: Observe, Then Ask

Players are reliable reporters of their experience (what they felt) but unreliable reporters of causes (why they felt it). Design your process accordingly.

Most Reliable ←———————————————→ Least Reliable
  What they did    What they felt    Why they think
  (behavior)       (experience)      they felt it
                                     (attribution)

Hierarchy of evidence:

Behavioral data — what players actually did (metrics, video, observation)
Experience reports — what players say they felt ("I was frustrated," "that was exciting")
Causal attribution — what players think caused their experience ("the controls are bad")

Players attributing frustration to "bad controls" might actually be experiencing a perception failure (they couldn't see the indicator) or a pacing problem (too many new concepts at once). Use behavior to diagnose; use self-report to locate.

Question Generation Framework

The Three-Pillar Method

Generate questions along the perception → attention → memory pipeline:

Perception Questions (Did they see it?)

Did the player notice [critical UI element / feedback / environmental cue]?
How long before they noticed?
Did they look at it before acting or after?
Did they confuse it with something else?

Attention Questions (Did they focus on the right thing?)

Where was the player looking during [critical moment]?
Did they engage with [intended system] or get distracted by [ancillary system]?
Did they understand what was important vs. optional?
Was there a moment where they seemed overwhelmed?

Memory Questions (Will they retain it?)

After a break, can they recall how to [key mechanic]?
Did they apply a lesson learned earlier to a later challenge?
Did they remember the goal after a distraction?
Can they explain the core rules to someone else?

Stage-Specific Questions

| Dev Stage | Focus | Key Questions | |-----------|-------|---------------| | Prototype | Core loop viability | Is the core action inherently interesting? Do they want to do it again? | | Alpha | System comprehension | Do they understand the rules? Can they make intentional decisions? | | Beta | Pacing and polish | Does the session arc feel right? Where do they get bored or frustrated? | | Pre-launch | Edge cases and balance | What breaks? What's exploitable? What did we miss? |

Observation Protocol

What to Watch (Not Ask)

| Observable | What It Tells You | |------------|-------------------| | First action | What the UI communicates as "start here" | | Hesitation points | Where clarity fails or cognitive load spikes | | Repeated failures | Where difficulty exceeds skill (or UI is misleading) | | Where they look | What's grabbing attention (intended or not) | | Body language | Leaning in = engaged; leaning back = disengaged; fidgeting = frustrated | | Utterances | Unprompted comments ("what?", "oh!", "come on") are gold | | Where they quit | The most valuable data point you'll collect | | What they skip | Content they ignore reveals priority mismatches |

The Silent Observer Protocol

Say nothing unless they're about to break the test setup
Don't explain — if they're confused, that's data
Don't reassure — "you're doing great" biases the session
Note timestamps — when you feel the urge to help, write down the time and why
Record everything — your memory of the session will be biased toward your expectations

Metrics to Track

Core Metrics (Track Always)

| Metric | What It Measures | Warning Signal | |--------|-----------------|----------------| | Session length | Engagement | Bimodal distribution (some quit fast, some stay long) | | Quit points | Pain points | Cluster of quits at same location/moment | | Completion rate | Difficulty/clarity | < 70% on intended-critical-path content | | Time per section | Pacing | Sections taking 2x+ longer than designed | | Death/failure rate | Difficulty curve | Spike = wall; zero = too easy |

Balance Metrics (When Tuning Systems)

| Metric | What It Measures | Warning Signal | |--------|-----------------|----------------| | Pick rate by option | Strategy diversity | One option > 50% pick rate | | Win rate by strategy | Balance | Any strategy > 55% win rate at comparable skill | | Average game/match length | Pacing | Games consistently shorter or longer than intended | | Resource accumulation rate | Economy health | Exponential growth = inflation incoming | | Strategy churn | Meta health | If dominant strategy shifts too fast, balance is noisy |

UX Metrics (When Testing Comprehension)

| Metric | What It Measures | Warning Signal | |--------|-----------------|----------------| | Time to first meaningful action | Onboarding quality | > 60 seconds before the player does something | | Tutorial completion rate | Tutorial design | < 90% = tutorial is the problem, not the player | | Hint/help usage | Clarity | High usage = UI isn't communicating; zero usage = help system is invisible | | Error rate on intended actions | Usability | Player tries to do the right thing but fails due to UI |

Avoiding Confirmation Bias

The biggest threat to useful playtest data is your own expectations.

Pre-Test Protocol

Before the session:

Write down your predictions — what do you expect to happen?
Define "surprising" outcomes — what would change your mind?
Assign a skeptic — one team member whose job is to challenge interpretations
Pre-commit to sample size — decide how many sessions before drawing conclusions (minimum 5 for qualitative, 30+ for quantitative)

Post-Test Protocol

After the session:

Review predictions vs. reality — where were you wrong? Those are the insights.
Separate observation from interpretation — "Player hesitated for 8 seconds at the door" (observation) vs. "Player didn't understand the door mechanic" (interpretation)
Look for disconfirming evidence — actively search for data that contradicts your preferred narrative
Quantify before concluding — "it felt like everyone struggled" vs. "3 of 7 players failed this section"
Delay solutions — understand the problem fully before proposing fixes

Common Bias Traps

| Trap | Mechanism | Counter | |------|-----------|---------| | Anchoring | First session dominates your impression | Review all sessions before concluding | | Availability | Dramatic moments overshadow quiet ones | Use metrics, not memory | | Projection | Attributing your own experience to players | Watch what they do, not what you'd do | | Sunk cost | Defending features you spent time on | Ask "would we add this today?" not "should we cut this?" | | Survivorship | Only hearing from players who stayed | Track quit points with equal priority |

Survey Design

Good Questions (Experience-Focused)

"How would you describe the experience in one word?"
"What moment stands out most?" (Then probe: "What made it stand out?")
"Was there a point where you wanted to stop? What was happening?"
"What would you do differently on a second playthrough?"
"Rate how [specific emotion] you felt during [specific moment]" (1-5 scale)

Bad Questions (Leading or Attributive)

"Did you find the controls intuitive?" (Leading — assumes controls are the issue)
"What would you change?" (Too broad — gets surface-level answers)
"Did you like it?" (Binary, social pressure toward "yes")
"Was it too hard?" (Leading — frames difficulty as the variable)
"What features would you add?" (Players aren't designers; this generates noise)

The One-Question Shortcut

If you can only ask one question: "Tell me about a moment that stood out — good or bad."

Then follow up with: "What were you trying to do?" and "What happened next?"

Interpreting Data

Decision Framework

| Signal | Confidence | Action | |--------|------------|--------| | Metrics + observation + self-report all agree | High | Act on it | | Metrics show it, observation confirms, self-report disagrees | Moderate-High | Trust behavior over self-report | | Self-report says it, but metrics/observation don't show it | Low | Investigate further — the report may point to a different real problem | | Single session shows it, others don't | Very Low | Note it but don't act — one data point isn't a pattern |

Sample Size Guidance

5-8 sessions — finds ~85% of major usability problems
15-20 — identifies behavioral patterns
30+ — minimum for quantitative conclusions (win rates, balance)
A/B tests — require statistical power calculation; varies by effect size

Solo Developer Validation

When you're building alone, you can't run traditional playtests during development. These techniques bridge the gap:

Self-Testing Techniques

| Technique | How | What It Catches | |-----------|-----|-----------------| | The 2-week break | Play your own game after not touching it for 2 weeks | UX failures, forgotten controls, unclear objectives | | The mute test | Play with sound off | Audio-dependent information, missing visual feedback | | The squint test | Squint at the screen or reduce resolution | Visual clarity, contrast, UI readability | | The record-and-review | Record gameplay, watch it the next day | Pacing problems, dead time, repetitive patterns | | The explain test | Explain what you're doing out loud while playing | Logic gaps, unjustified assumptions, unclear goals | | The wrong-hand test | Play with your non-dominant hand | Input complexity, timing windows, control accessibility |

Recruiting First Testers

When you're ready for external eyes (earlier than you think):

Friends/family who DON'T play games — best for UX/clarity testing
Friends who play games in your genre — best for feel/depth testing
Online communities (itch.io, indie forums, Discord) — best for unbiased feedback
Start with 3 testers — even 3 external players reveal more than 100 hours of self-play

Solo Metrics

If you're a solo developer shipping updates:

Track your own play session length (are YOU getting bored?)
Count deaths/failures per section (is difficulty spiking where you don't intend?)
Time each section (is pacing matching your design?)
Screenshot every moment of confusion or frustration — these are your UX bugs

Cross-References

game-design — Playtest scenarios from the 5-Component Framework (new player, stress, skill, abuse, readability tests)
systems-design — System health metrics (behavioral diversity, archetype formation) measured through playtesting
player-ux — The cognitive pillars (perception/attention/memory) drive the question generation framework
game-balance — Metrics-driven iteration for detecting and resolving balance problems
economy-design — Economy health monitoring metrics to track during playtests
experience-design — Testing whether the intended experience matches actual player experience
motivation-design — Testing retention and motivation through session length and return rate
encounter-design — Testing spatial readability and encounter fairness
narrative-design — Testing narrative comprehension and engagement
game-feel — "Does this feel good?" requires observation, not surveys

Playtest Design

When to Activate

Use this skill when:

Planning a playtest session (what to test, who to recruit, what to measure)
Designing post-playtest surveys or interview questions
Setting up analytics/metrics for ongoing data collection
Interpreting playtest results and deciding what to change
Resolving team disagreements about what the data shows

Core Principle: Observe, Then Ask

Players are reliable reporters of their experience (what they felt) but unreliable reporters of causes (why they felt it). Design your process accordingly.

Most Reliable ←———————————————→ Least Reliable
  What they did    What they felt    Why they think
  (behavior)       (experience)      they felt it
                                     (attribution)

Hierarchy of evidence:

Behavioral data — what players actually did (metrics, video, observation)
Experience reports — what players say they felt ("I was frustrated," "that was exciting")
Causal attribution — what players think caused their experience ("the controls are bad")

Question Generation Framework

The Three-Pillar Method

Generate questions along the perception → attention → memory pipeline:

Perception Questions (Did they see it?)

Did the player notice [critical UI element / feedback / environmental cue]?
How long before they noticed?
Did they look at it before acting or after?
Did they confuse it with something else?

Attention Questions (Did they focus on the right thing?)

Where was the player looking during [critical moment]?
Did they engage with [intended system] or get distracted by [ancillary system]?
Did they understand what was important vs. optional?
Was there a moment where they seemed overwhelmed?

Memory Questions (Will they retain it?)

After a break, can they recall how to [key mechanic]?
Did they apply a lesson learned earlier to a later challenge?
Did they remember the goal after a distraction?
Can they explain the core rules to someone else?

Stage-Specific Questions

Observation Protocol

What to Watch (Not Ask)

The Silent Observer Protocol

Say nothing unless they're about to break the test setup
Don't explain — if they're confused, that's data
Don't reassure — "you're doing great" biases the session
Note timestamps — when you feel the urge to help, write down the time and why
Record everything — your memory of the session will be biased toward your expectations

Metrics to Track

Core Metrics (Track Always)

Balance Metrics (When Tuning Systems)

UX Metrics (When Testing Comprehension)

Avoiding Confirmation Bias

The biggest threat to useful playtest data is your own expectations.

Pre-Test Protocol

Before the session:

Write down your predictions — what do you expect to happen?
Define "surprising" outcomes — what would change your mind?
Assign a skeptic — one team member whose job is to challenge interpretations
Pre-commit to sample size — decide how many sessions before drawing conclusions (minimum 5 for qualitative, 30+ for quantitative)

Post-Test Protocol

After the session:

Review predictions vs. reality — where were you wrong? Those are the insights.
Separate observation from interpretation — "Player hesitated for 8 seconds at the door" (observation) vs. "Player didn't understand the door mechanic" (interpretation)
Look for disconfirming evidence — actively search for data that contradicts your preferred narrative
Quantify before concluding — "it felt like everyone struggled" vs. "3 of 7 players failed this section"
Delay solutions — understand the problem fully before proposing fixes

Common Bias Traps

Survey Design

Good Questions (Experience-Focused)

"How would you describe the experience in one word?"
"What moment stands out most?" (Then probe: "What made it stand out?")
"Was there a point where you wanted to stop? What was happening?"
"What would you do differently on a second playthrough?"
"Rate how [specific emotion] you felt during [specific moment]" (1-5 scale)

Bad Questions (Leading or Attributive)

"Did you find the controls intuitive?" (Leading — assumes controls are the issue)
"What would you change?" (Too broad — gets surface-level answers)
"Did you like it?" (Binary, social pressure toward "yes")
"Was it too hard?" (Leading — frames difficulty as the variable)
"What features would you add?" (Players aren't designers; this generates noise)

The One-Question Shortcut

If you can only ask one question: "Tell me about a moment that stood out — good or bad."

Then follow up with: "What were you trying to do?" and "What happened next?"

Interpreting Data

Decision Framework

Sample Size Guidance

5-8 sessions — finds ~85% of major usability problems
15-20 — identifies behavioral patterns
30+ — minimum for quantitative conclusions (win rates, balance)
A/B tests — require statistical power calculation; varies by effect size

Solo Developer Validation

When you're building alone, you can't run traditional playtests during development. These techniques bridge the gap:

Self-Testing Techniques

Recruiting First Testers

When you're ready for external eyes (earlier than you think):

Friends/family who DON'T play games — best for UX/clarity testing
Friends who play games in your genre — best for feel/depth testing
Online communities (itch.io, indie forums, Discord) — best for unbiased feedback
Start with 3 testers — even 3 external players reveal more than 100 hours of self-play

Solo Metrics

If you're a solo developer shipping updates:

Track your own play session length (are YOU getting bored?)
Count deaths/failures per section (is difficulty spiking where you don't intend?)
Time each section (is pacing matching your design?)
Screenshot every moment of confusion or frustration — these are your UX bugs

Cross-References

game-design — Playtest scenarios from the 5-Component Framework (new player, stress, skill, abuse, readability tests)
systems-design — System health metrics (behavioral diversity, archetype formation) measured through playtesting
player-ux — The cognitive pillars (perception/attention/memory) drive the question generation framework
game-balance — Metrics-driven iteration for detecting and resolving balance problems
economy-design — Economy health monitoring metrics to track during playtests
experience-design — Testing whether the intended experience matches actual player experience
motivation-design — Testing retention and motivation through session length and return rate
encounter-design — Testing spatial readability and encounter fairness
narrative-design — Testing narrative comprehension and engagement
game-feel — "Does this feel good?" requires observation, not surveys

Adoption

rbergman/playtest-design

$ install --global

Security Scan Results

SKILL.md

Playtest Design

When to Activate

Core Principle: Observe, Then Ask

Question Generation Framework

The Three-Pillar Method

Stage-Specific Questions

Observation Protocol

What to Watch (Not Ask)

The Silent Observer Protocol

Metrics to Track

Core Metrics (Track Always)

Balance Metrics (When Tuning Systems)

UX Metrics (When Testing Comprehension)

Avoiding Confirmation Bias

Pre-Test Protocol

Post-Test Protocol

Common Bias Traps

Survey Design

Good Questions (Experience-Focused)

Bad Questions (Leading or Attributive)

The One-Question Shortcut

Interpreting Data

Decision Framework

Sample Size Guidance

Solo Developer Validation

Self-Testing Techniques

Recruiting First Testers

Solo Metrics

Cross-References

Related Skills

rbergman/repo-init

rbergman/lead

rbergman/worktrees

rbergman/subagent

rbergman/playtest-design

$ install --global

Security Scan Results

SKILL.md

Playtest Design

When to Activate

Core Principle: Observe, Then Ask

Question Generation Framework

The Three-Pillar Method

Stage-Specific Questions

Observation Protocol

What to Watch (Not Ask)

The Silent Observer Protocol

Metrics to Track

Core Metrics (Track Always)

Balance Metrics (When Tuning Systems)

UX Metrics (When Testing Comprehension)

Avoiding Confirmation Bias

Pre-Test Protocol

Post-Test Protocol

Common Bias Traps

Survey Design

Good Questions (Experience-Focused)

Bad Questions (Leading or Attributive)

The One-Question Shortcut

Interpreting Data

Decision Framework

Sample Size Guidance

Solo Developer Validation

Self-Testing Techniques

Recruiting First Testers

Solo Metrics

Cross-References

Related Skills

rbergman/repo-init

rbergman/lead

rbergman/worktrees

rbergman/subagent