agentic/code/addons/agent-loop/skills/infer-completion-criteria/SKILL.md
Infer measurable completion criteria for an agent-loop task from project docs, code, and AIWG standards when the user has not supplied --completion explicitly
npx skillsauth add jmagly/aiwg infer-completion-criteriaInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When a user starts an agent-loop task without supplying --completion, this skill derives a measurable, verifiable completion criterion from project state. The output must satisfy the vague-discretion rule: a concrete shell command or file-inspection check that returns pass/fail unambiguously.
Iteration is only as good as its gate. A loop with a vague gate ("until it's done") runs forever or exits prematurely. This skill is what turns "agent-loop this" into "agent-loop this until <measurable thing>."
The canonical name for the iterative-loop addon is agent-loop. ralph is the legacy name for the executor skill, retained as an alias; al is a short form. The detection/routing skill is agent-loop (which delegates to this skill when criteria are missing); the executor is ralph (canonical name forthcoming). Everywhere this skill says "agent-loop" you can read "ralph" as the legacy equivalent.
This skill is invoked by:
agent-loop detection-and-routing skill when it parses a user request without explicit completion criteriaralph executor skill during Phase 1 initialization when --completion is omittedagent-loop-ext external-loop launcher during pre-launch resolution when --completion is omittedaiwg discover "infer completion" → aiwg show skill infer-completion-criteria when a user wants to preview the inferred criterion before committing to a loopThis skill does not run when --completion is explicit. The user's word is authoritative.
The skill is a deterministic walk through five evidence layers, plus one synthesis step. Each layer contributes candidate criteria; the synthesis picks the strongest measurable one and explains the chain of evidence.
Parse the user's task description for an intent verb. Map to a default criterion class:
| Verb / phrase | Criterion class | |---|---| | "fix tests", "make tests pass", "test failure" | Test suite passes (exit 0) | | "add tests", "increase coverage", "test coverage" | Coverage threshold met | | "fix types", "type errors", "migrate to typescript" | Type checker exits 0 | | "fix lint", "clean up warnings", "style" | Linter exits 0 | | "build", "make it compile" | Build command exits 0 | | "refactor", "extract", "rename" | Tests still pass AND build still passes (regression gate) | | "implement <X>", "add feature <X>" | Tests for the new code exist and pass | | "document", "add docs", "JSDoc" | Coverage check on docstrings/JSDoc presence | | "fix bug", "resolve issue #N" | Specific test for that bug passes AND existing suite still green | | "migrate", "upgrade" | Build + test + lint all green (no regression) |
If the verb is ambiguous, the skill falls back to "regression gate" (build + test + lint all green) as the safest default.
Read the project's context files. AIWG-managed projects often declare commands directly:
# Run tests
npm test
# Type check
npx tsc --noEmit
# Lint markdown
npm exec markdownlint-cli2 "**/*.md"
Extract these as the canonical commands for their respective domains. The Development section of CLAUDE.md is the highest-trust source here — it's what the project's maintainers run.
Also scan for explicit completion-criterion conventions. Some projects state "a commit is not finished until CI passes" — that signals the CI command (or equivalent local invocation) is the gate.
Inspect the project's manifest files to discover scripts and tools:
| Manifest | Where to look |
|---|---|
| package.json | scripts.test, scripts.lint, scripts.build, scripts.coverage, scripts.typecheck |
| Cargo.toml | implies cargo test, cargo build, cargo clippy |
| pyproject.toml | [tool.pytest], [tool.ruff], [tool.mypy], scripts.* |
| go.mod | implies go test ./..., go vet ./..., go build ./... |
| Gemfile | implies bundle exec rspec, bundle exec rubocop |
| pom.xml / build.gradle | mvn test, mvn verify, gradle test |
| .tool-versions / mise.toml | language version pins inform which tool is canonical |
When multiple scripts exist (e.g. test, test:unit, test:integration), prefer the script the project's own docs reference. If the docs don't reference any, prefer the most specific match to the task verb (e.g. for "fix integration test" → test:integration).
CI files encode the team's actual definition of "passes":
| CI system | Scan |
|---|---|
| GitHub Actions | .github/workflows/*.yml — extract run: steps from non-deploy jobs |
| Gitea Actions | .gitea/workflows/*.yml — same |
| GitLab CI | .gitlab-ci.yml — extract script: from test/lint jobs |
| CircleCI | .circleci/config.yml |
| Jenkins | Jenkinsfile |
The first non-trivial verification step in the primary workflow is the team's canonical "done" gate. If CI runs npm test && npm run lint && npm run typecheck in order, the inferred criterion is "all three exit 0."
If the project has a .aiwg/ directory, scan for relevant context:
.aiwg/testing/test-strategy.md — declared verification approach.aiwg/architecture/software-architecture-doc.md — architectural quality gates.aiwg/security/security-gates.md — security-related criteria.aiwg/quality/code-review-guide.md — code quality bars.aiwg/activity.log — recent operations that may indicate what "done" looked like for similar past tasks.aiwg/working/<related-progress-files>.md — prior task progress files; mine the "Completion criteria" sectionsIf the project has a related use case (.aiwg/requirements/UC-*.md) whose ID is in the task description, pull that use case's acceptance criteria — those ARE the completion criteria.
Combine the layers into a single proposed criterion. The decision logic:
.aiwg/requirements/ matches the task, use its acceptance criteria verbatim. Done.command-A passes AND command-B passes AND command-C passes.<file or change exists in git diff against HEAD~1> — pure structural checkValidate the proposal against the vague-discretion rule:
max-iterations cap (ralph's default 10 is the floor; very large refactors may need 20).If the proposal fails any of these, regenerate. If after two regenerations the proposal still fails, surface the problem to the user with the diagnostic ("could not find a measurable verification command — please supply one explicitly").
The skill emits a single block of structured output for the calling skill (agent-loop router, ralph executor, or agent-loop-ext launcher) to consume:
proposed_completion:
criterion: "npm test passes AND npx tsc --noEmit exits 0"
verification_command: "npm test && npx tsc --noEmit"
rationale:
- "Task verb 'refactor' triggers regression gate (Layer 1)"
- "package.json scripts.test = 'jest --coverage' (Layer 3)"
- "CLAUDE.md Development section references both npm test and npx tsc --noEmit (Layer 2)"
- ".github/workflows/ci.yml runs both as required checks (Layer 4)"
confidence: high # high | medium | low
alternatives_considered:
- criterion: "npm run lint exits 0"
rejected_because: "Lint is not in CI required checks for this repo"
max_iterations_suggestion: 10
needs_human_confirmation: false # true if confidence == low OR criterion is unusual
When the skill runs in non-interactive mode (via aiwg al --auto-criteria or the equivalent on agent-loop / agent-loop-ext), needs_human_confirmation: false proceeds directly. Otherwise the consuming skill (the agent-loop router or the ralph / agent-loop-ext executor) shows the proposal to the user and confirms.
In interactive mode, after running the pipeline, present the proposal:
No --completion criteria was provided. I inferred:
Criterion: npm test passes AND npx tsc --noEmit exits 0
Verification: `npm test && npx tsc --noEmit`
Evidence:
- Task verb "refactor" → regression gate
- package.json scripts.test = jest --coverage
- CLAUDE.md Development section references both checks
- .github/workflows/ci.yml requires both
Proceed with this criterion? [Y/n/edit]
User options:
Y (default): start the loop with the inferred criterionn: abort and request the user supply --completion explicitlyedit: accept a manual edit to the criterion before proceedingUse the platform's native interaction tool when available (per native-ux-tools rule). On Claude Code, that means AskUserQuestion.
--completion is explicit → use the user's criterion, don't second-guess--no-infer-completion flag is passed → fail fast with a helpful error if --completion is also missingnpx tsc --noEmit pass") → extract the command from the task, don't re-infer| Case | Handling |
|---|---|
| No package.json, no manifest, no CI | Scan project root for any executable test runner (pytest, go test, cargo test, make test, Makefile target test). Fall back to the structural check if none found. |
| Multiple test commands (test:unit, test:integration, test:e2e) | Prefer the one nearest to the task scope. If task mentions "unit", use test:unit. If the task is broad, prefer the union via &&. |
| CI runs tests on multiple OS/Node versions | Use the local invocation (npm test), not the matrix runner. The matrix is a deploy concern. |
| Monorepo with multiple packages | Detect from pnpm-workspace.yaml / lerna.json / turbo.json / workspaces field. If the task scope is one package, infer that package's commands. If the task spans the monorepo, use the top-level test script. |
| Project has no tests at all | This is a finding. The inferred criterion should be "tests exist for the new code AND those tests pass." Surface to the user that the project lacks a baseline test suite — that's important context for the loop's expectations. |
| Project's tests are currently broken (the task IS to fix them) | Set the criterion to the passing condition. The whole point of the loop is to get from current red state to green. |
| Conflict between layers (CLAUDE.md says X, CI says Y) | Prefer CLAUDE.md (closer to the maintainer's intent). Note the discrepancy in the rationale. |
| Rule | How this skill respects it |
|---|---|
| vague-discretion | The whole point — produces measurable, command-form criteria |
| instruction-comprehension | Reads the task description carefully; doesn't override explicit user instructions |
| research-before-decision | Layer-walk IS research; doesn't propose criteria without evidence |
| human-authorization | Confirms with user before starting loop (unless --auto-criteria explicitly granted) |
| auto-compact-continue | The inferred criterion IS what the loop continues toward; no "should I keep working" prompts |
| cli-secondary | Uses aiwg discover to find related skills if the task verb is unusual |
$ agent-loop "fix the failing auth tests" # or: aiwg al / aiwg ralph (legacy)
Inferring completion criteria...
Layer 1 verb: "fix tests" → test-pass class
Layer 2 CLAUDE.md: "npm test" is the canonical test command
Layer 3 package.json: scripts.test = "jest"
Layer 4 CI: .github/workflows/ci.yml runs `npm test`
Layer 5: no related use case found
Proposed criterion: npm test passes (exit 0)
Verification: `npm test`
Confidence: high
Proceed? [Y/n/edit]
$ agent-loop "extract auth logic into a separate module"
Inferring completion criteria...
Layer 1 verb: "extract" → regression gate
Layer 2 CLAUDE.md: no Development section
Layer 3 package.json: scripts.test = "echo 'no tests'" (degenerate)
Layer 4 CI: no workflows found
Layer 5: no related use case found
Warning: project has no functional test suite or CI configuration.
Falling back to structural verification.
Proposed criterion: The new module exists, the original code references it,
AND `npx tsc --noEmit` still exits 0
Verification: `test -f src/auth/index.ts && grep -q 'from.*src/auth' src/main.ts && npx tsc --noEmit`
Confidence: low
This is a weak gate. Consider supplying --completion explicitly
or adding a test suite first.
Proceed? [Y/n/edit]
$ agent-loop "implement UC-AUTH-001"
Inferring completion criteria...
Layer 1 verb: "implement" → tests-exist class
Layer 5: found .aiwg/requirements/UC-AUTH-001-user-login.md
Using acceptance criteria from UC-AUTH-001:
- [ ] User can log in with valid email/password
- [ ] Invalid credentials show clear error message
- [ ] Account locks after 5 failed attempts
- [ ] Login completes within 2 seconds
Proposed criterion: All acceptance criteria from UC-AUTH-001 verified by tests,
AND `npm test -- --testPathPattern=auth` passes
Verification: `npm test -- --testPathPattern=auth`
Confidence: high (acceptance criteria are explicit)
Proceed? [Y/n/edit]
$ agent-loop "make the code better"
Inferring completion criteria...
Layer 1 verb: "make better" → AMBIGUOUS, no clear criterion class
Layer 5: no related use case
Cannot infer measurable criteria for this task.
"Make the code better" is vague (per AIWG vague-discretion rule).
A loop with no measurable gate runs forever or exits prematurely.
Please supply --completion with a concrete check, e.g.:
--completion "npm test passes AND npm run lint exits 0"
--completion "all functions in src/utils/ have JSDoc"
--completion "complexity score from eslint < 10 for all files"
Or rephrase the task with a concrete intent:
agent-loop "reduce cyclomatic complexity in src/utils/"
agent-loop "add JSDoc to all exported functions in src/api/"
--completion is missingralph is legacy for agent-loop's executor)data-ai
Report which research-corpus radar sidecars are overdue for refresh. Computes staleness (days since last refresh vs the cadence window) for every radar, sorted most-overdue-first. Runs via `aiwg corpus radar-status`.
data-ai
Aggregate research-corpus radar sidecars into a corpus or per-cluster freshness report — totals, overdue count, per-cluster / per-GRADE / per-trajectory breakdowns, an overdue table, and per-radar rationale snippets. Runs via `aiwg corpus radar-report`.
testing
Scaffold radar/freshness sidecars for research-corpus REFs. Pulls title/authors from the citation sidecar and GRADE from the analysis doc, defaults the refresh cadence from GRADE and the cluster from a corpus-local map, and stamps documentation/radar/REF-XXX-radar.md. Runs via `aiwg corpus radar-init`.
data-ai
Compute an entity's publication trajectory — per-year paper counts, topic drift, hot-streak detection (≥3 consecutive A-grade years), and career phase. Runs via `aiwg corpus profile-temporal`.