Progressive Estimation

Estimate AI-assisted and hybrid human+agent work with progressive disclosure. Research-backed formulas with PERT statistics, confidence bands, and calibration feedback loops.

Trigger

Use when a user wants time/effort estimates for:

AI-assisted development tasks
Mixed human+agent workflows
Batch estimation of backlog items
Staffing or rollout planning with agents

Keywords: estimate, how long, effort, sizing, story points, rounds, agent time

Interaction

Use the AskUserQuestion tool if available in your environment for ALL user interactions: mode selection (Phase 0), intake questions (Phase 1), batch confirmation ("adjust any tasks?"), and tracker selection (Phase 4). This creates structured dropdowns instead of free-text back-and-forth. Ask one question at a time (or group related questions, up to 4 per call) and wait for the response before proceeding. If the tool is not available, fall back to conversational questions in your text output. Instant mode skips Phase 0 and Phase 1 questions but still uses the tool for Phase 4 tracker selection.

Workflow

Phase 0: Mode Selection

Ask two questions upfront (one at a time):

Control: "How much control do you want?"
- Instant — zero questions, infer everything from the task description, output immediately
- Quick — 4 questions with sensible defaults
- Detailed — 13 questions, full control over every parameter
Scope: "Single task or batch?"

This produces six paths:

Instant + Single → infer all parameters, output immediately
Instant + Batch → accept list, infer per task, output summary table
Quick + Single → ~4 questions then output
Quick + Batch → accept list, apply defaults to all, output summary table
Detailed + Single → full questionnaire (13 questions), rich output
Detailed + Batch → full questionnaire for shared parameters, per-task overrides

Instant Mode

When the user selects Instant mode:

Infer complexity, task_type, and risk_level from the task description
Apply all defaults: 1 human, 1 agent, partial maturity, risk 1.3, integration 15%, human fix 20%, standard review, domain familiarity 1.0, confidence 80%, ready phase, solo-startup org
Skip directly to Phase 3 (Estimation) and Phase 4 (Output)
After output, suggest: "Want to refine? Say 'quick' for 4 questions or 'detailed' for full control."

Phase 1: Intake

Load references/questionnaire.md for the appropriate path.

Quick path asks only:

What's the task? (or paste your list)
What type of work? (coding / bug-fix / investigation / testing / infrastructure / data-migration / design)
How many humans and how many agents?
Automation maturity: exploratory / partial / mostly-automated?

If task type is obvious from the description, auto-assign it and note it.

Detailed path adds: risk factors, integration complexity, domain familiarity, review depth, human fix ratio, confidence level, definition phase, organization context, and per-task dependency mapping.

Every question feeds a specific formula variable — see the mapping table in questionnaire.md for the complete wiring.

Phase 1.5: Cooperation Mode Detection (Automatic)

Auto-detect the team's cooperation mode from intake answers:

if num_agents == 0 → Human-only mode
if num_agents > 0 AND maturity in [exploratory, partial] → Hybrid mode
if num_agents > 0 AND maturity == mostly-automated → Agent-first mode

Announce the detected mode before proceeding:

| Mode | Announcement | Points Approach | |------|-------------|----------------| | Human-only | "Detected: human-only team. Using standard estimation with story points." | Points for sizing and velocity | | Hybrid | "Detected: hybrid team. Using dual-track estimation — points for sizing, hours for planning." | Dual-track: points + hours | | Agent-first | "Detected: agent-first team. Planning by human review capacity in hours." | Hours only; points optional for rough sizing |

This mode affects output format (Phase 4) and calibration recommendations (Phase 5).

Phase 2: Framework Selection

Load references/frameworks.md to select the right model:

| Scenario | Framework | |----------|-----------| | Single agent task | Round-based | | Multi-agent project | Module/wave | | Backlog import or rough sizing | S-M-L conversion |

For batch input, apply the selected framework per task, then roll up with dependency sequencing.

Phase 3: Estimation

Load references/formulas.md and compute estimates using the canonical formulas. Claude performs the arithmetic inline — no external script needed.

The computation pipeline:

Base rounds × risk × domain familiarity → agent rounds
Agent rounds × minutes per round → agent time
Integration, human fix (agent-effectiveness-adjusted), review, planning
Apply org size overhead to human time
Apply task type multiplier to subtotal
Apply cone of uncertainty spread to widen/narrow range
Compute PERT expected value and standard deviation
Apply confidence multiplier for committed estimate
Compute token & cost estimates (Step 15)
Check anti-pattern guards and generate warnings

If the user requests a standalone deterministic calculator, generate one from formulas.md in their preferred language. The generated script must:

Accept inputs via CLI args or stdin
Output JSON with all canonical fields
Have zero external dependencies
Be a single self-contained file

Phase 3.5: Small Council Validation (Automatic)

Three subagent perspectives review the estimate before output:

Optimist: Best-case analyst. What if everything goes right? Looks for parallelizable work, reducible scope.
Skeptic: Risk analyst. What's missing? Hidden dependencies, integration surprises, underestimated review.
Historian: Calibration analyst. How do similar past tasks compare? Reference stories, velocity data.

Trigger rules based on complexity:

| Complexity | Council Members | Rationale | |------------|----------------|-----------| | S | None | Overhead exceeds value | | M | Skeptic only | 1 subagent | | L | Skeptic + Historian | 2 subagents | | XL | Full council (all 3) | All subagents |

Each agent gives 2-3 sentences. Output a consensus estimate that weighs all perspectives. Flag disagreements where agents differ by >20%.

Note: Use the Agent tool to spawn council members as subagents if available. If not available, simulate the perspectives inline.

Phase 4: Output

Load references/output-schema.md for formatting.

Always lead with the one-line summary, then PERT block, then expand:

Expected: ~4 hrs | Committed (80%): ~5.5 hrs | 10-26 agent rounds + 3 hrs human | Risk: medium | Size: M

Then provide:

PERT expected value with confidence bands (68%, 95%)
Full breakdown table
Anti-pattern warnings (if any triggered)
Tracker-formatted output (if requested)

Ask which tracker and mode:

Tracker: Linear, JIRA, ClickUp, GitHub Issues, Monday, GitLab, Asana, Azure DevOps, Zenhub, Shortcut, or generic
Mode: Native fields or embedded in description (default: embedded)

For batch output, produce a summary table first, then rollup, then warnings, then per-task details on request.

Phase 5: Calibration (Optional)

Load references/calibration.md when the user wants to log actuals or tune their estimation ratios.

Key calibration features:

PRED(25) tracking — target 75% of estimates within 25% of actuals
Reference story library — concrete examples per size per task type
Bias detection — systematic over/under estimation identification
Team profiles — separate calibration per team
Convergence tracking — teams reach accuracy within 5-10 cycles

Re-estimation Triggers

Suggest re-estimation when:

Scope changes after initial estimate
A task is blocked by an unresolved dependency
Midpoint check reveals estimate drift >30%
Team composition changes (humans or agents added/removed)
Automation maturity level shifts during the project
Definition phase advances (narrower cone of uncertainty)

Loading Phases

| Phase | Files Loaded | When | |-------|-------------|------| | 0-1 | questionnaire.md | Always (intake); skipped in Instant mode | | 1.5 | (no files) | Automatic after intake | | 2 | frameworks.md | After intake | | 3 | formulas.md | During computation | | 3.5 | calibration.md (reference stories) | Automatic for M+ tasks (council validation) | | 4 | output-schema.md | During output | | 5 | calibration.md | On request |

Key Concepts

Agent Effectiveness Decay

Based on METR research (24k runs, 228 tasks): AI agents excel at small tasks (~90% effectiveness) but effectiveness drops to ~30% for XL tasks. The skill automatically increases human effort allocation for larger tasks.

Log-Normal Three-Point Estimation

Every estimate produces a weighted expected value using a log-normal weighting: (min + 4×geometric_mean + max) / 6 with standard deviation (max - min) / 6. Deep validation (KS test, n=84k) showed log-normal fits actual software effort distributions better than PERT-beta in all size bands.

Confidence Levels

Size-dependent multipliers derived from 84k estimate-actual pairs. Small tasks need larger buffers due to wider actual/estimate spreads:

50% = stretch goal (raw expected value)
80% = likely delivery (1.4–1.8x depending on size, default for quick path)
90% = safe commitment for external deadlines (2.0–2.9x depending on size)

Cone of Uncertainty

Early-phase estimates have wider ranges. The skill widens min/max spread based on how well-defined the work is, preventing false precision.

Anti-Pattern Guards

The skill warns against common estimation mistakes: oversized tasks, false precision, estimates-as-commitments, and point-to-hour conversions.

Skill Type

Encoded Preference Skill — sequences a workflow Claude can already perform, following a specific estimation process with research-backed formulas, PERT statistics, and calibration feedback loops.

Progressive Estimation

Estimate AI-assisted and hybrid human+agent work with progressive disclosure. Research-backed formulas with PERT statistics, confidence bands, and calibration feedback loops.

Trigger

Use when a user wants time/effort estimates for:

AI-assisted development tasks
Mixed human+agent workflows
Batch estimation of backlog items
Staffing or rollout planning with agents

Keywords: estimate, how long, effort, sizing, story points, rounds, agent time

Interaction

Workflow

Phase 0: Mode Selection

Ask two questions upfront (one at a time):

Control: "How much control do you want?"
- Instant — zero questions, infer everything from the task description, output immediately
- Quick — 4 questions with sensible defaults
- Detailed — 13 questions, full control over every parameter
Scope: "Single task or batch?"

This produces six paths:

Instant + Single → infer all parameters, output immediately
Instant + Batch → accept list, infer per task, output summary table
Quick + Single → ~4 questions then output
Quick + Batch → accept list, apply defaults to all, output summary table
Detailed + Single → full questionnaire (13 questions), rich output
Detailed + Batch → full questionnaire for shared parameters, per-task overrides

Instant Mode

When the user selects Instant mode:

Infer complexity, task_type, and risk_level from the task description
Apply all defaults: 1 human, 1 agent, partial maturity, risk 1.3, integration 15%, human fix 20%, standard review, domain familiarity 1.0, confidence 80%, ready phase, solo-startup org
Skip directly to Phase 3 (Estimation) and Phase 4 (Output)
After output, suggest: "Want to refine? Say 'quick' for 4 questions or 'detailed' for full control."

Phase 1: Intake

Load references/questionnaire.md for the appropriate path.

Quick path asks only:

What's the task? (or paste your list)
What type of work? (coding / bug-fix / investigation / testing / infrastructure / data-migration / design)
How many humans and how many agents?
Automation maturity: exploratory / partial / mostly-automated?

If task type is obvious from the description, auto-assign it and note it.

Detailed path adds: risk factors, integration complexity, domain familiarity, review depth, human fix ratio, confidence level, definition phase, organization context, and per-task dependency mapping.

Every question feeds a specific formula variable — see the mapping table in questionnaire.md for the complete wiring.

Phase 1.5: Cooperation Mode Detection (Automatic)

Auto-detect the team's cooperation mode from intake answers:

if num_agents == 0 → Human-only mode
if num_agents > 0 AND maturity in [exploratory, partial] → Hybrid mode
if num_agents > 0 AND maturity == mostly-automated → Agent-first mode

Announce the detected mode before proceeding:

This mode affects output format (Phase 4) and calibration recommendations (Phase 5).

Phase 2: Framework Selection

Load references/frameworks.md to select the right model:

| Scenario | Framework | |----------|-----------| | Single agent task | Round-based | | Multi-agent project | Module/wave | | Backlog import or rough sizing | S-M-L conversion |

For batch input, apply the selected framework per task, then roll up with dependency sequencing.

Phase 3: Estimation

Load references/formulas.md and compute estimates using the canonical formulas. Claude performs the arithmetic inline — no external script needed.

The computation pipeline:

Base rounds × risk × domain familiarity → agent rounds
Agent rounds × minutes per round → agent time
Integration, human fix (agent-effectiveness-adjusted), review, planning
Apply org size overhead to human time
Apply task type multiplier to subtotal
Apply cone of uncertainty spread to widen/narrow range
Compute PERT expected value and standard deviation
Apply confidence multiplier for committed estimate
Compute token & cost estimates (Step 15)
Check anti-pattern guards and generate warnings

If the user requests a standalone deterministic calculator, generate one from formulas.md in their preferred language. The generated script must:

Accept inputs via CLI args or stdin
Output JSON with all canonical fields
Have zero external dependencies
Be a single self-contained file

Phase 3.5: Small Council Validation (Automatic)

Three subagent perspectives review the estimate before output:

Optimist: Best-case analyst. What if everything goes right? Looks for parallelizable work, reducible scope.
Skeptic: Risk analyst. What's missing? Hidden dependencies, integration surprises, underestimated review.
Historian: Calibration analyst. How do similar past tasks compare? Reference stories, velocity data.

Trigger rules based on complexity:

Each agent gives 2-3 sentences. Output a consensus estimate that weighs all perspectives. Flag disagreements where agents differ by >20%.

Note: Use the Agent tool to spawn council members as subagents if available. If not available, simulate the perspectives inline.

Phase 4: Output

Load references/output-schema.md for formatting.

Always lead with the one-line summary, then PERT block, then expand:

Expected: ~4 hrs | Committed (80%): ~5.5 hrs | 10-26 agent rounds + 3 hrs human | Risk: medium | Size: M

Then provide:

PERT expected value with confidence bands (68%, 95%)
Full breakdown table
Anti-pattern warnings (if any triggered)
Tracker-formatted output (if requested)

Ask which tracker and mode:

Tracker: Linear, JIRA, ClickUp, GitHub Issues, Monday, GitLab, Asana, Azure DevOps, Zenhub, Shortcut, or generic
Mode: Native fields or embedded in description (default: embedded)

For batch output, produce a summary table first, then rollup, then warnings, then per-task details on request.

Phase 5: Calibration (Optional)

Load references/calibration.md when the user wants to log actuals or tune their estimation ratios.

Key calibration features:

PRED(25) tracking — target 75% of estimates within 25% of actuals
Reference story library — concrete examples per size per task type
Bias detection — systematic over/under estimation identification
Team profiles — separate calibration per team
Convergence tracking — teams reach accuracy within 5-10 cycles

Re-estimation Triggers

Suggest re-estimation when:

Scope changes after initial estimate
A task is blocked by an unresolved dependency
Midpoint check reveals estimate drift >30%
Team composition changes (humans or agents added/removed)
Automation maturity level shifts during the project
Definition phase advances (narrower cone of uncertainty)

Loading Phases

Key Concepts

Agent Effectiveness Decay

Log-Normal Three-Point Estimation

Confidence Levels

Size-dependent multipliers derived from 84k estimate-actual pairs. Small tasks need larger buffers due to wider actual/estimate spreads:

50% = stretch goal (raw expected value)
80% = likely delivery (1.4–1.8x depending on size, default for quick path)
90% = safe commitment for external deadlines (2.0–2.9x depending on size)

Cone of Uncertainty

Early-phase estimates have wider ranges. The skill widens min/max spread based on how well-defined the work is, preventing false precision.

Anti-Pattern Guards

The skill warns against common estimation mistakes: oversized tasks, false precision, estimates-as-commitments, and point-to-hour conversions.

Skill Type

Encoded Preference Skill — sequences a workflow Claude can already perform, following a specific estimation process with research-backed formulas, PERT statistics, and calibration feedback loops.

Adoption

enreign/progressive-estimation

$ install --global

Security Scan Results

SKILL.md

Progressive Estimation

Trigger

Interaction

Workflow

Phase 0: Mode Selection

Instant Mode

Phase 1: Intake

Phase 1.5: Cooperation Mode Detection (Automatic)

Phase 2: Framework Selection

Phase 3: Estimation

Phase 3.5: Small Council Validation (Automatic)

Phase 4: Output

Phase 5: Calibration (Optional)

Re-estimation Triggers

Loading Phases

Key Concepts

Agent Effectiveness Decay

Log-Normal Three-Point Estimation

Confidence Levels

Cone of Uncertainty

Anti-Pattern Guards

Skill Type

Related Skills

steipete/skill-creator

steipete/healthcheck

openclaw/skill-creator

openclaw/healthcheck

enreign/progressive-estimation

$ install --global

Security Scan Results

SKILL.md

Progressive Estimation

Trigger

Interaction

Workflow

Phase 0: Mode Selection

Instant Mode

Phase 1: Intake

Phase 1.5: Cooperation Mode Detection (Automatic)

Phase 2: Framework Selection

Phase 3: Estimation

Phase 3.5: Small Council Validation (Automatic)

Phase 4: Output

Phase 5: Calibration (Optional)

Re-estimation Triggers

Loading Phases

Key Concepts

Agent Effectiveness Decay

Log-Normal Three-Point Estimation

Confidence Levels

Cone of Uncertainty

Anti-Pattern Guards

Skill Type

Related Skills

steipete/skill-creator

steipete/healthcheck

openclaw/skill-creator

openclaw/healthcheck