skills/context-engineering/SKILL.md
Analyze Liza `.liza/agent-prompts/` and `.liza/agent-outputs/` from a context-engineering perspective: prompt payload shape, context budget use, cacheability, duplicated or missing context, instruction hierarchy, tool-output pressure, role-specific context fit, and prompt-output feedback loops. Use when diagnosing agent context bloat, prompt drift, poor agent handoffs, repeated misunderstandings, excessive tool output, or whether Liza agents received the right information at the right time.
npx skillsauth add liza-mas/liza context-engineeringInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Analyze only .liza/agent-prompts/ and .liza/agent-outputs/ unless the user names other artifacts.
This skill is complementary to liza-logs: liza-logs finds operational failures and token/tool patterns; this skill explains whether the prompt and context design caused or amplified those patterns.
Run the corpus indexer first:
python3 skills/context-engineering/scripts/context-corpus-index.py .liza
Use the generated index as the primary source for mechanical discovery: inventory, prompt/output pairing, size and pressure signals, outcome signals, common tools, MCP usage, and sample selection. The index is not evidence of causality by itself.
The indexer supports both Claude rich stream-json logs and Codex sparse item.completed logs. Check the reported format counts before assuming which fields are available.
Use indexer options deliberately:
python3 skills/context-engineering/scripts/context-corpus-index.py .liza --json
python3 skills/context-engineering/scripts/context-corpus-index.py .liza --max-pair-minutes 30
python3 skills/context-engineering/scripts/context-corpus-index.py .liza --sample-limit 25
--json when exact pair metadata, token fields, or full metrics are needed.--max-pair-minutes to control how strict same-role timestamp pairing should be.--sample-limit to expand or shrink top lists and the sampling plan.If a liza-logs report or analyzer output is available, use it as the first sampling guide. Prioritize roles, runs, or timestamps with repeated tool failures, broad tool-result volume, duplicated task-local material, growing prompts, low cache reuse for expected-stable prefixes, or blocked/rejected task outcomes.
If liza-logs and context-engineering evidence disagree, report the disagreement explicitly and keep the narrower claim supported by direct prompt/output evidence. Example: liza-logs may correctly flag token pressure while prompt shape is not the cause.
Before opening raw prompt/output files, use the index's Sampling Plan plus the sections relevant to the question: Largest Prompts, Largest Outputs, High Tool-Output Pressure, Outcome Signal Mentions, Role Distribution, Prompt Size Trends, Common Tools, MCP Usage, and Pairing.
Treat indexer outcome signals as text mentions until confirmed by structured state, blackboard, or source context. They are good sampling signals, not proof that a verdict or status transition occurred.
Use the script's pair confidence labels to guide evidence tier and confidence: exact-stem, within-5m, within-30m, within-2h, low-confidence, and no-pair. Report the pairing confidence or matching window for prompt-causality claims.
Classify evidence tier before making claims:
Fallback indexing when the script is unavailable:
wc -c to identify largest prompts and outputs.Use the index's largest-prompt, largest-output, token, cache, and pressure signals to choose prompts and outputs for deeper reading.
Use the index's Role Distribution table for per-role prompt/output averages, output-to-prompt ratios, and prompt size trend classification (stable, growing, shrinking). Use Prompt Size Trends for chronological size progressions on roles with ≥10 prompts or non-stable trends.
Separate expected fixed cost from avoidable bloat before raising findings:
Treat these as structural pressure signals:
Classify prompt segments as:
| Segment | Cacheability question | |---------|-----------------------| | Stable prefix | Is this identical across runs so provider prompt caching can reuse it? | | Semi-stable role context | Does it change only when role, contract, skill, or pipeline config changes? | | Task-local context | Is it necessary for this run, or should it be referenced/deferred? | | Volatile context | Does timestamped state, logs, or generated output appear earlier than needed? |
Flag high-leverage cacheability issues:
Check salience order separately from cacheability. A stable prefix helps provider caching, but the agent must still find decision-relevant context quickly. Prefer rendered prompts where the usable task surface is easy to locate:
Flag packing failures where stable but low-salience material buries the task, where broad references appear before the current decision surface, or where large artifacts are embedded when a precise source pointer would let the agent load them on demand.
Actively raise these prompt-shape issues when supported by evidence:
For each sampled prompt, classify context into:
| Class | Question | |-------|----------| | Contract | Did the prompt include required mode, tool, and guardrail context? | | Task | Is the concrete task unambiguous, bounded, and falsifiable? | | Domain | Does the agent receive the specs, docs, skills, or state it needs? | | Operational | Are commands, paths, worktree rules, validation requirements, and approval rules clear? | | Noise | What content is duplicated, stale, irrelevant, or too broad for the role? |
Look for context-engineering failures:
For each sampled output, trace behavior back to context:
When outputs show failures, distinguish:
Use these heuristics:
Compare prompt-output chains across roles:
Flag handoff compression failures where important nuance disappears, and handoff bloat where downstream agents receive full upstream artifacts when a structured digest would suffice.
Adversarial pair overlap is not duplication. Liza's doer/reviewer pairs require both agents to receive the same context so the reviewer can independently verify the doer's work. High content overlap between paired roles (e.g., analyst + reviewer, coder + code-reviewer, writer + us-reviewer) is by design. Do not flag it as redundancy. The relevant questions for paired prompts are whether the shared context is too large, poorly ordered, or volatile — not whether it is duplicated.
A good compressed handoff preserves:
For each proposed fix, identify the smallest source artifact likely to implement it:
internal/prompts/templates/Existing-context check: before recommending a change, verify whether the intended instruction or context already exists upstream but is not rendered, is rendered too late, or is buried by higher-volume context. This prevents recommending duplicate contract/spec/template text when the actual issue is rendering, ordering, routing, or salience.
Each finding's Fix must name:
.liza/agent-prompts/ or .liza/agent-outputs/ sampleDo not present a fix-localization finding until the relevant prompt/output pair and source template, config, contract, guardrail, skill, spec, state field, or operational process has been inspected.
Produce findings in this format:
# Context Engineering Report
## Executive Summary
- Prompt corpus: [N prompts, size range, roles]
- Output corpus: [N outputs, size range, roles]
- Primary bottleneck: [one sentence]
When referring to a specific session in the Executive Summary or Findings, name
the concrete `.liza/agent-outputs/` log filename. If the claim depends on a
paired prompt, also name the concrete `.liza/agent-prompts/` filename.
## Findings
### P1/P2/P3: [Finding Title]
- **Evidence tier:** [output-only | prompt+output | prompt+output+source | state-supported]
- **Evidence:** [specific prompt/output files and concise observed fact]
- **Confidence:** [high/medium/low, based on prompt-output pairing quality and evidence completeness]
- **Context mechanism:** [why this context shape leads to the behavior]
- **Impact:** [cost, failure rate, review churn, blocked tasks, context pressure]
- **Fix:** [source artifact to change, expected rendered prompt/output difference]
- **Validation:** [future `.liza/agent-prompts/` or `.liza/agent-outputs/` sample that would prove the fix worked]
## Context Budget Opportunities
| Opportunity | Expected effect | Risk |
|-------------|-----------------|------|
| [dedupe/summarize/defer/load-on-demand] | [token or behavior impact] | [what could regress] |
## Non-Findings
Items checked that are acceptable. Include these to avoid repeated rediscovery.
Prioritize findings by observed effect, not theoretical prompt neatness.
When subagent delegation is supported, consider delegating separable evidence-gathering slices:
Recommend delegation only when slices are separable, evidence can be passed as raw artifacts, and the main agent can integrate findings without duplicating all reads.
development
Coordinate Pairing-mode doer/reviewer sessions through a Markdown blackboard. Use when the user invokes /adversarial-pairing with role and blackboard-path arguments or asks multiple pairing agents to coordinate plan review, implementation, staged code review, and follow-up review rounds without Liza multi-agent mode.
data-ai
Analyze Liza agents logs
development
Code Review Protocol
tools
Transform requirements into user stories for coding tasks