MJ-skill/codex-orchestration/SKILL.md
Claude Code + Codex CLI collaboration framework. Covers invocation, role assignment, efficiency defaults, cost routing, and dispatch patterns. Load when Claude Code delegates work to the `codex` CLI via Bash.
npx skillsauth add develata/deve-skills codex-orchestrationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill describes defaults and judgment, not enforced rules. Hard enforcement lives in the harness UserPromptSubmit hook (rules 2/3 there). Skill content guides; harness gates.
Default mechanism: Claude Code invokes the codex CLI through the Bash tool. (The mcp__codex__* MCP tools still exist as fallback but are not the default.)
# Prompt as argument
codex exec --full-auto -s read-only -C "$PWD" \
"Read code/foo.py:28-50 and explain the bar() function. Answer in <200 words."
# Prompt from stdin (preferred for any prompt > a few lines)
codex exec --full-auto -s read-only -C "$PWD" < /tmp/codex_task_a3f1.md
# Long-running: route via Bash run_in_background
codex exec --full-auto -s workspace-write -C "$PWD" \
-o /tmp/codex_result_a3f1.md < /tmp/codex_task_a3f1.md > /tmp/codex_log_a3f1.txt 2>&1 &
| Flag | Meaning | Default for analytical tasks |
|---|---|---|
| -s, --sandbox <mode> | read-only / workspace-write / danger-full-access | read-only for inspection, workspace-write for impl |
| --full-auto | low-friction auto execution (analog of MCP approval-policy=never) | recommended for trusted projects |
| -m, --model <slug> | model override | omit (inherits ~/.codex/config.toml) — see § Model selection |
| -c <key=value> | config override (TOML) | -c model_reasoning_effort=xhigh for hard reviews |
| -C, --cd <dir> | working directory | project root |
| -o, --output-last-message <file> | write final agent message to file (clean for parsing) | use for any background dispatch |
| --json | emit JSONL events to stdout | use when downstream parser needs structure |
| --ephemeral | do not persist session jsonl | only for throwaway probes; default is persist |
CODEX_HOME env var selects the account (per codex-account-switching skill):
# codex-main (default — ~/.codex)
codex exec --full-auto ...
# codex-b (~/.codex-b)
CODEX_HOME=~/.codex-b codex exec --full-auto ...
Default order when multiple accounts exist: main → b → others alphabetical. Switch when current account hits 429 / 5xx / quota / model-not-available / repeated ApprovalDenied. Once a session is created on an account, follow-ups (codex exec resume <session-id>) should stay on that account — sessions and quota are per-CODEX_HOME.
Exemptions to main-first (state inline at call site): parallel scoring batches that need account independence (§9 Stage-2), main quota >50% used → route non-critical to b, user-pinned account.
Principle: LLM cannot reliably report its own model slug or CLI version, and different CODEX_HOME accounts can carry different CLI minor versions simultaneously. The single authoritative source is the per-account session jsonl. Critical dual artifacts should verbatim-embed the helper's 5-line stdout in their header, citing session_path as the primary anchor.
Authoritative source per account:
$CODEX_HOME/sessions/YYYY/MM/DD/rollout-<timestamp>-<session-id>.jsonl (default ~/.codex if CODEX_HOME unset)Recording recipe (prefer the helper over hand-grep): run ~/.claude/skills/codex-orchestration/scripts/codex_session_meta.sh <session-id> and verbatim-embed the 5-line stdout (model= / cli_version= / effort= / session_path= / session_first_ts=) into the dual artifact header. The helper auto-discovers all ~/.codex* homes and uses pipefail-safe grep.
Manual fallback (if helper unavailable):
grep -oE '"model":"[^"]*"' <rollout.jsonl> | head -1
head -1 <rollout.jsonl> | python3 -c "import sys,json; print(json.loads(sys.stdin.read())['payload']['cli_version'])"
grep -oE '"reasoning_effort":"[a-z]+"' <rollout.jsonl> | head -1
Calibration example — what a compliant header looks like:
model=gpt-5.5
cli_version=0.125.0
effort=xhigh
session_path=~/.codex/sessions/<YYYY>/<MM>/<DD>/rollout-<session-id>.jsonl
session_first_ts=2026-04-29T17:29:49Z
Verbatim from helper stdout. session_path disambiguates which CODEX_HOME and which exact JSONL.
Counter-example — Codex's text reply self-reports GPT-5 codex-cli 0.125.0; jsonl shows 0.125.0-alpha.3 on one home and 0.125.0 on another. Transcribing the text reply silently loses the alpha.3 suffix and conflates the two accounts. The catch: model self-report is a hallucination surface; jsonl is ground truth.
Copy the matching boundary template from ~/.claude/skills/codex-orchestration/templates/:
| Boundary | Template |
|---|---|
| (a) plan-prior | (a)_plan_prior_dual_artifact_template.md |
| (b) decision-node | (b)_decision_node_dual_artifact_template.md |
| (c) post-impl | (c)_postimpl_dual_artifact_template.md |
Header annotation slot = first content block; paste helper stdout verbatim. Each template encodes its boundary's evidence requirements (e.g. (c) reviewer independently re-runs source-code trace + reproduces empirical numbers per harness rule 3(c)).
Default role mapping by task type:
| Task Type | Executor | Reviewer | Rationale | |---|---|---|---| | Implementation | Codex | Claude | Codex is cheaper for bounded execution; Claude has deeper context for correctness review. | | Analysis / research | Codex | Claude | Claude has stronger critical judgment for synthesis. | | Critical decisions | Both independently | Claude synthesizes | Avoids single-model blind spots. | | Mechanical / low-priority | Codex alone | None | Reviewer overhead not worth it. |
dispatch_codex.sh invocation requires a CronCreate probe per codex-dispatch-watchdog before the wrapper runs (project PreToolUse hook enforces; sentinel mtime ≤ 270s required with */3 * * * * cron, updated 2026-05-19). Inline codex exec "<short>" returning in seconds is exempt — stall is visible./tmp/codex_user_context_<ID>.md and Claude's instructions to /tmp/codex_task_<ID>.md, then codex exec ... < /tmp/codex_task_<ID>.md (or have the prompt instruct codex to read both files). Keeps Claude's conversation context small and separates raw user intent from Claude's framing. CLI has no MCP-style 8 KB transport limit, but disk handoff still preserves auditability and lets you cite the exact prompt later.--full-auto (or -c approval_policy=never) for trusted projects to avoid interactive prompts. Equivalent of MCP approval-policy=never. Without it, codex may prompt for shell commands and block autonomous execution.-s read-only for analysis/inspection tasks. Use workspace-write only when the task actually edits files.codex exec resume <session-id> for follow-up questions in the same domain — avoids cold-start.codex exec ... background jobs (each & in Bash, then wait), not sequential.Claude tokens are expensive, codex tokens are cheap. Route accordingly:
| Task Profile | Route To | Reason | |---|---|---| | High-stakes (architecture, tradeoff, review, final synthesis) | Claude directly | Quality matters more than cost. | | Low-priority mechanical (bulk inspection, formatting, docs, simple summaries) | Codex | Save Claude token budget. | | Bounded execution (implementation, testing, debugging) | Codex | Well-scoped, codex is sufficient. | | Open-ended planning, synthesis, multi-source integration | Claude | Requires deep reasoning. |
Principle: Default behavior is to omit -m so the call inherits the per-account ~/.codex/config.toml setting. Explicit -m only when (i) API-key account with non-ChatGPT-allowed slug, (ii) reproducing a historical session's model for continuity audit, or (iii) deliberate regression testing. Before passing any explicit -m, run the verification recipe.
Calibration example — a compliant analytical-task dispatch:
codex exec --full-auto -s read-only -C "$PWD" \
-o /tmp/codex_result_a3f1.md \
< /tmp/codex_task_a3f1.md
# no `-m` — inherits config.toml gpt-5.5
Verification trail (cite in artifact): grep '^model\s*=' ~/.codex/config.toml → model = "gpt-5.5". Account default matches project's dominant historical pattern.
Counter-example — copying from a tool-schema description without verifying:
codex exec -m gpt-5.2 ... # ← pulled from MCP schema example text
Account default is gpt-5.5; project's historical dual artifacts use gpt-5.5. Override silently routes to a different model than every prior dual, breaking cross-batch consistency. Catch: the verification recipe forces a grep against config.toml and historical artifacts before any explicit override — if the slug doesn't match either, abort.
Pre-dispatch verification recipe (artifact-grounded; before passing any explicit -m):
# 1. account default
grep -E '^model\s*=' ~/.codex/config.toml # codex-main
grep -E '^model\s*=' ~/.codex-b/config.toml # codex-b (if exists)
# 2. project-historical usage distribution
grep -rh '^model=' outputs/reports/*/run_*/dual* \
outputs/reports/*/run_*/'(a)'* \
outputs/reports/*/run_*/'(b)'* \
outputs/reports/*/run_*/'(c)'* 2>/dev/null \
| sort | uniq -c | sort -rn
# 3. effort distribution (same paths)
grep -rh '^effort=' <same paths as above> | sort | uniq -c | sort -rn
Pick the model that matches (i) account default OR (ii) project's dominant historical pattern. Cite the verification step in the dispatch artifact's "dispatch configuration" block.
gpt-5.5 supports low | medium | high | xhigh. Default in ~/.codex/config.toml is high. Override per-call:
codex exec --full-auto -c model_reasoning_effort=xhigh ...
Consider xhigh (extra cost + latency, ~2× reasoning_output_tokens):
Keep high (default):
xhigh once and the answer was clear.Verify actual effort applied: grep -oE '"effort":"[^"]*"' <rollout.jsonl> | head -1.
When delegating to codex, frame each task using this template:
Scope: [files, directories, or modules]
Goal: [what to accomplish]
Constraints: [what not to change, time/size limits]
Expected output: [format and content of the result]
Non-goals: [what this task explicitly does NOT cover]
dual-agent-original-request-review/SKILL.md § Verification Discipline. Codex tasks that touch the codebase or real artifacts inherit those defaults.When Claude cannot confidently determine all relevant file paths before dispatching codex. Common triggers:
Don't use when file paths are already known — go directly to standard Two-File handoff. Decision rule: if Claude can fill the Scope field of the Task Framing Template (§5) with specific paths, skip this protocol.
Round 1 — Broad discovery (codex searches, Claude evaluates)
Claude → codex: "Search <directory/module> for <keywords/patterns>.
List files with: path, one-line summary, relevance tier, mtime (if artifact).
Cap at 15 files."
codex → Claude: file list + relevance notes
Claude evaluates: assign relevance tiers, pick top 3-5 files for Round 2.
Claude may add new keywords discovered from codex's file list.
Round 2 — Narrowed inspection (standard Two-File handoff resumes)
Claude writes discovered paths into /tmp/codex_task_<ID>.md as the Scope field.
Claude → codex: "Read <specific files/line ranges>.
Answer <targeted analytical question>."
codex → Claude: detailed findings
Round 3 — Targeted verification (only if Round 2 reveals a gap)
Allowed only for: verifying a specific claim, chasing one newly discovered dependency.
NOT allowed for: introducing a new hypothesis or widening scope.
Claude → codex (via `codex exec resume <session-id>`):
"Read <newly discovered file> lines X-Y.
Verify whether <specific claim from Round 2>."
codex → Claude: verification result
Soft cap: 3 rounds. If relevant files still cannot be located after 3 rounds, escalate to user rather than expanding further.
| Tier | Meaning | Action | |---|---|---| | High | Directly implements or contains the target logic/data | Read in Round 2 | | Medium | Tangentially related (imports, configs, test files) | Read only if High files are insufficient | | Low | Unlikely relevant (naming coincidence, unrelated module) | Drop unless no better candidates exist |
Threshold: at least 2 High-tier files before proceeding to Round 2. If Round 1 yields 0 High files, refine keywords and retry Round 1 once (counts toward the 3-round cap).
codex exec resume <session-id> across all rounds — codex retains discovery context, reducing cold-start.artifact-grounded-review/SKILL.md § Artifact Staleness Check) applies. Codex should report artifact mtimes in Round 1 so Claude can filter stale ones before Round 2.Scope: <top-level directory or module — intentionally broad>
Goal: Find files related to <topic/keywords/patterns>.
For each file report: path | one-line summary | relevance (High/Med/Low) | mtime (artifacts only).
Constraints:
- Cap at 15 files
- Do not read file contents — list only
- Include mtime for result artifacts (JSON/CSV/NPZ) so staleness can be assessed
Expected output: Markdown table
Non-goals: Deep analysis — that comes in Round 2.
/) instead of a targeted directory.workspace-write sandbox for read-only tasks.Any new literature task: related-work writing, baseline exploration, method divergence, survey gap-filling.
Level 1 — Triage (relevance decision)
Per paper: abstract + introduction only.
Source priority: arXiv TeX > journal HTML/Markdown > local PDF first ~2 pages.
Codex output (one row per paper): go/kill, one-line reason, relevance tag (which project line).
Level 2 — Method reading (papers that pass triage)
Per paper: relevant section only (method / experiments), not full text.
Same source priority as Level 1.
Level 3 — Full paper (needs reproduction or deep citation)
Full load, but the same round must update
`docs/10_diagnosis/literature_*/literature_manifest.json`.
https://arxiv.org/e-print/<id> (arxiv.org in settings.local.json WebFetch allowlist).Minimum fields:
paper_id / title / year / download_urlproject_use (specific to claim / baseline / method)not_for (scenarios where extrapolation would be wrong)Locations (example layout):
docs/literature/<track>/literature_manifest.jsondocs/literature/fault_injection/literature_manifest.jsonartifact-grounded-review: any literature-based judgment cites specific paper:section; don't promote a codex one-line summary to a project claim.Principle: full PDFs dilute context and degrade subsequent decisions. Before reading any PDF or dispatching one to codex, answer three questions:
| Level | TeX / HTML / Markdown | PDF | |---|---|---| | 1. Triage | abstract + intro only | full PDF avoid; if PDF-only, extract abstract+intro as plain text first | | 2. Method reading | relevant section only | page-range excerpt only (e.g. pp. 3–7); avoid full PDF | | 3. Full paper | OK | OK, with manifest update in same round |
Recovery if violated: /compact the ingested PDF, re-feed at correct Level. The gate is Claude's responsibility on the main thread, not codex's.
Open-ended technical decisions: candidates unknown, multiple paths viable, need broad divergence + strict scoring. Typical:
Don't use for: executing already-decided tasks, bug fix, implementing known interfaces — those go through standard Two-File handoff (§3).
codex exec — not resume)Stage 1 — Divergent Generator (codex session A)
Input: raw user request + baseline context + constraints only.
No "prior attempts", no "previously rejected ideas", no Claude's own lean.
Ask: "Generate N=10 candidate approaches, each with: name, one-paragraph
rationale, key assumption, failure mode. Do not rank. Do not self-filter."
Output: 10 candidates, flat list.
Stage 2 — Strict Scorer (codex session B, isolated from A)
Input: raw user request + baseline context + ONE candidate at a time.
Session B sees no other candidates, no generator's self-assessment.
Ask: "Score this candidate on: novelty, feasibility, baseline-compatibility,
implementation complexity, expected gain. Give go / revise / kill plus
rationale. Cite specific code or artifact paths from baseline when
claiming compatibility."
Output: 10 independent score cards (parallel dispatch, fresh sessions).
Stage 3 — Decisive Synthesis (Claude, main thread)
Read all 10 score cards. Rank by score + project constraints
(journal bar / scope / comparability_impact risk).
Produce a single ranked short-list (top 2-3) with rationale.
If top candidate has UNVERIFIED compatibility claim, route to codex
for verification before committing.
codex exec resume).dual-agent-original-request-review raw-request preservation).If Stage 2 stalls (CLI upgrade, auth expiry, Claude Code restart, user interrupt), maintain a resume-state file. Working copy: /tmp/<dual_name>_state_<date>.md; mirror to packet dir (docs/methodology-notes/<dual_name>_packet_<date>/resume_state.md) on every update — /tmp is volatile, packet dir is durable source of truth.
State file contents:
model=<slug>, cli=<ver> from session jsonl (see § Model verification oracle), so cross-session resume is auditable.Write-through: update the state file immediately after each scorecard completes, before dispatching the next. Batched updates cause "I thought C4 was pending but it was already done" bugs.
Artifact archival: dual working artifacts are evidence, not scratch. /tmp is working space. On any of these boundaries, copy artifacts from /tmp into the packet dir and commit:
resume_state.md).git commit touching the packet — skeleton-only commits are insufficient if S1 divergent output or S2 scorecards exist.Mapping convention:
/tmp/codex_stage1_<topic>_output_<date>.md → S1_divergent_output.md/tmp/codex_stage2_scorecard_C<n>_<date>.md → S2_scorecard_C<n>.md/tmp/claude_pre_review_<ID>.md → 01_claude_pre_review.md/tmp/<dual_name>_state_<date>.md → resume_state.mdOn resume: don't re-run completed stages. The sealed pre-review and Stage 1 output remain authoritative across interruptions; re-dispatch only pending Stage-2 scorecards. Stage 3 synthesis reads all scorecards from the packet dir (not /tmp, which may be empty post-reboot).
artifact-grounded-review: Stage 2 cites file:line / artifact:key, missing-evidence scores marked UNVERIFIED.dual-agent-original-request-review Verification Discipline; Stage 3 outputs that change project direction (trainer / baseline / scoring) still go through dual-agent review for formal acceptance.## Divergent-Strict-Decisive Decision Log
Stage 1 candidates (codex session A, session-id=<...>):
1. <name> — <one-sentence>
2. ...
10. ...
Stage 2 score cards (parallel, 10 independent sessions):
| Candidate | novelty | feasibility | compat | complexity | gain | verdict | key evidence |
|---|---|---|---|---|---|---|---|
Stage 3 ranked short-list (Claude):
1. <top pick> — rationale + UNVERIFIED claims to resolve
2. <runner-up>
Missing Stage 2 evidence column or Stage 3 short-list rationale → process incomplete; Claude fills in before proceeding.
Use the TodoWrite tool when work has these properties:
ssh_tmux session).in_progress at start, completed immediately on finish — not batched.success_signal / artifact_output_paths.in_progress if blocked; new task describes blocker + resolution path.tools
Collect and audit Codex token usage with a bundled Python CLI and optional Windows batch launchers. Use this skill when the user asks to check Codex token usage, generate daily token audit logs, calculate monthly CostUSD totals, review Codex spending, or run Codex token usage scripts on Windows, Bash, WSL, or Linux.
tools
Use when giving the user an INLINE reply that carries a trade-off, a decision, a verdict, or a non-trivial finding (decision brief / round verdict / failure root-cause). NOT for "done"/status confirmations, one-line answers, or pure data dumps. Forces a compact decision-brief shape and blocks internal tool-name / file-path bleed into user-facing text.
development
Use for cross-file or cross-chapter terminology audits and corpus-wide term unification in thesis/paper sources — extract candidate term drift, build a decision queue, classify each occurrence, apply accepted replacements safely, and verify counts/build. Trigger on "术语审计", "术语统一", "术语一致性", "逐词审", "这个词全文怎么用", "把 X 全文改成 Y", "terminology audit", or "unify term X". Do NOT use for ordinary prose drafting or a single known-location edit; use academic-writing for prose quality and claim-boundary judgment.
tools
Use for ANY codex CLI dispatch via dispatch wrapper (no time threshold; presence-of-risk triggers, not estimated wall — stdin-EOF stalls occur at <60s). Combines internal log-inactivity watchdog wrapper + external Claude-session cron probe + sentinel hook enforcement. Detects stalls in ≤60-270s vs hours-without-detection failure mode.