skills/meta-optimize/SKILL.md
Analyze ARIS usage logs and propose optimizations to SKILL.md files, reviewer prompts, and workflow defaults. Outer-loop harness optimization inspired by Meta-Harness (Lee et al., 2026). Use when user says "优化技能", "meta optimize", "improve skills", "分析使用记录", or wants to optimize ARIS's own harness components based on accumulated experience.
npx skillsauth add wanshuiyin/Auto-claude-code-research-in-sleep meta-optimizeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Analyze accumulated usage logs and propose optimizations for: $ARGUMENTS
meta-optimize proposes; it does not land. The mutation of the skill corpus
is the exclusive job of a separate, human-invoked skill: /meta-apply.
This split is structural, not advisory — it is why a missed instruction cannot let
this loop apply its own patch (the self-acquittal failure mode):
Write/Edit tool. This skill cannot edit a SKILL.md / shared-reference /
any corpus file with the frictionless mutators. Its only outputs are the REPORT and
staged patch files, written under .aris/meta/ (a scratch area, never the corpus)./meta-apply; a human must then
invoke /meta-apply to land them. That human action is the landing gate.Bash writes to the corpus are filtered, not impossible — be honest about the
layers. What IS fully closed: the accidental / in-flow self-acquittal — this skill
has no Write/Edit and no apply step, so an honest run cannot slip into editing the
corpus. Defense-in-depth: install the
corpus_write_guard PreToolUse
hook (like meta_logging.json), which DENIES the common Bash shell-writes (>, tee,
sed -i, cp/mv, touch, open(...,'w')) to corpus paths. This is a blacklist,
NOT a complete sandbox — a deliberately obscured Bash write (git apply, patch,
$var/absolute paths, language file APIs) is not all caught. Full structural
prevention requires either removing this skill's Bash or an FS sandbox — over-built
for a not-yet-load-bearing producer, so deferred to when the gate carries real
auto-modification volume (a brick-3 trigger). The intended backstop against a deliberate
write is detection, not prevention — a corpus change with no valid/current
provenance stamp (content-hash mismatch) would be catchable in a pre-push integrity
check — but that verifier is NOT yet built (provenance.py has content_hash but no
integrity-check subcommand, and no pre-push hook runs one). So today the deliberate-write
case is neither prevented nor actively detected; track the integrity verifier as a
follow-up before this producer goes load-bearing. Its legitimate Bash writes go only to
.aris/meta/.See shared-references/acceptance-gate.md:
a loop can DRIVE (propose, review) same-model, but the ACQUITTAL that lands a change
must be cross-model (Step 4 jury) and the landing must be a separate human-gated
act (/meta-apply).
ARIS is a research harness — a system of skills, bridges, workflows, and artifact contracts that wraps around LLMs to orchestrate research. This skill implements a prototype outer loop that observes how the harness is used and proposes improvements to the harness itself (not to the research artifacts it produces).
Inspired by Meta-Harness (Lee et al., 2026): the key insight is that harness design matters as much as model weights, and harness engineering can be partially automated by logging execution traces and using them to guide improvements.
| Component | Example | Optimizable? |
|-----------|---------|:---:|
| SKILL.md prompts | Reviewer instructions, quality gates, step descriptions | Yes |
| Default parameters | difficulty: medium, MAX_ROUNDS: 4, threshold: 6/10 | Yes |
| Convergence rules | When to stop the review loop, retry counts | Yes |
| Workflow ordering | Skill chain sequence within a workflow | Yes |
| Artifact schemas | What fields go in EXPERIMENT_LOG.md, idea-stage/IDEA_REPORT.md | Cautious |
| MCP bridge config | Which reviewer model, routing rules | No (infra) |
Not optimized: The research artifacts themselves (papers, code, experiments). That's what the regular workflows do.
templates/claude-hooks/meta_logging.json into your project's .claude/settings.json (or merge the hooks section)..aris/meta/events.jsonl. The skill will check and warn if insufficient.EVENTS_FILE=".aris/meta/events.jsonl"
if [ ! -f "$EVENTS_FILE" ]; then
echo "ERROR: No event log found at $EVENTS_FILE"
echo "Enable logging first: copy templates/claude-hooks/meta_logging.json into .claude/settings.json"
exit 1
fi
EVENT_COUNT=$(wc -l < "$EVENTS_FILE")
SKILL_INVOCATIONS=$(grep -c '"skill_invoke"' "$EVENTS_FILE" || echo 0)
SESSIONS=$(grep -c '"session_start"' "$EVENTS_FILE" || echo 0)
echo "📊 Event log: $EVENT_COUNT events, $SKILL_INVOCATIONS skill invocations, $SESSIONS sessions"
if [ "$SKILL_INVOCATIONS" -lt 5 ]; then
echo "⚠️ Insufficient data (<5 skill invocations). Continue using ARIS normally and re-run later."
exit 0
fi
Read .aris/meta/events.jsonl and compute:
Frequency analysis:
Failure analysis:
Convergence analysis (for auto-review-loop):
Human intervention analysis:
Present findings as a structured summary table.
Based on Step 1, rank optimization opportunities by expected impact:
## Optimization Opportunities (ranked)
| # | Target | Signal | Proposed Change | Expected Impact |
|---|--------|--------|-----------------|-----------------|
| 1 | auto-review-loop default threshold | Users override to 7/10 in 60% of runs | Change default from 6/10 to 7/10 | Fewer manual overrides |
| 2 | experiment-bridge retry count | 40% of runs hit max retries on OOM | Add OOM-specific recovery (reduce batch size) | Fewer failed experiments |
| 3 | paper-write de-AI patterns | Users manually fix "delve" in 80% of runs | Add "delve" to default watchword list | Fewer manual edits |
If $ARGUMENTS specifies a target skill, focus analysis on that skill only.
If $ARGUMENTS is empty or "all", analyze all skills with sufficient data.
For each optimization target, generate a concrete diff:
--- a/skills/auto-review-loop/SKILL.md
+++ b/skills/auto-review-loop/SKILL.md
@@ -15,7 +15,7 @@
## Constants
-- **SCORE_THRESHOLD = 6** — Minimum review score to accept.
+- **SCORE_THRESHOLD = 7** — Minimum review score to accept. (Raised based on usage data: 60% of users overrode to 7+.)
Rules for patch generation:
shared-references/capture-antipatterns.md):
run a proposed patch's rationale through tools/capture_filter.py (resolve via
the canonical chain). NEVER propose a change that encodes a negative
tool-capability claim ("codex can't…", "gemini is broken") or a one-off /
transient failure as a durable rule — those harden into self-cited refusals.
Encode the fix / the flag needed / the workaround, not "X can't do Y".This review is advisory — it sharpens the Step-5 REPORT so the human can decide what to stage. It is not the landing verdict. The binding cross-model jury runs later, at landing, inside
/meta-apply, on the actual staged diff (a producer-relayed verdict would be forgeable). Record this result asadvisory_screenonly.
Send each patch to GPT-5.5 xhigh for adversarial review:
mcp__codex__codex:
model: gpt-5.5
config: {"model_reasoning_effort": "xhigh"}
prompt: |
You are reviewing a proposed optimization to an ARIS SKILL.md file.
## Original Skill (relevant section)
[paste original]
## Proposed Patch
[paste diff]
## Evidence from Usage Log
[paste summary stats]
Review this patch:
1. Does the evidence support the change?
2. Could this change hurt other use cases?
3. Is the change minimal and safe?
4. Score 1-10: should this be applied?
If score < 7, explain what additional evidence would be needed.
Output a structured report:
# ARIS Meta-Optimization Report
**Date**: [today]
**Data**: [N] events, [M] skill invocations, [K] sessions
**Target**: [skill name or "all"]
## Proposed Changes
### Change 1: [title]
- **Target**: [skill/file:line]
- **Signal**: [what the data shows]
- **Patch**: [diff]
- **Reviewer Score**: [X/10]
- **Reviewer Notes**: [summary]
- **Status**: ✅ Recommended / ⚠️ Needs more data / ❌ Rejected
### Change 2: ...
## Changes NOT Made (insufficient evidence)
- [pattern observed but too few samples]
## Recommendations
- [ ] Apply Change 1 (reviewer approved)
- [ ] Collect more data for Change 3 (need N more runs)
- [ ] Consider manual review of Change 2
## Next Steps
This skill only **proposes**. To land changes: tell me which to stage, then run
`/meta-apply` (a separate, human-invoked applier that re-checks the cross-model
verdict before mutating anything). meta-optimize never applies.
/meta-apply (NO in-skill apply)This skill does not apply anything. After the user has read the Step-5 REPORT and indicated which changes to land, stage them for the privileged applier:
N, write its unified diff to
.aris/meta/pending/<NN>_<skill>.diff and append a row to
.aris/meta/pending/manifest.jsonl:
{patch: "<NN>_<skill>.diff", target: "<corpus path>", author_model: "<executor>", advisory_screen: "pass|kill", advisory_reason: "<one line>"}.
The advisory_screen (your Step-4 codex pre-review) is advisory only — it helps
the human read the REPORT. It is NOT the landing verdict and /meta-apply does not
trust it: a producer-written verdict would be forgeable. The binding cross-model jury
runs at landing, inside /meta-apply, on the actual staged diff./meta-apply to judge & land them."The backup → fresh jury-at-landing → apply → provenance stamp → log all happen
inside /meta-apply. meta-optimize never touches the corpus and
never produces the acquittal.
Never apply in this skill. Landing is /meta-apply + a fresh jury + a human, always.
The log at .aris/meta/events.jsonl contains JSONL records with these shapes:
{"ts":"...","session":"...","event":"skill_invoke","skill":"auto-review-loop","args":"difficulty: hard"}
{"ts":"...","session":"...","event":"PostToolUse","tool":"Bash","input_summary":"pdflatex main.tex"}
{"ts":"...","session":"...","event":"codex_call","tool":"mcp__codex__codex","input_summary":"review..."}
{"ts":"...","session":"...","event":"tool_failure","tool":"Bash","input_summary":"python train.py"}
{"ts":"...","session":"...","event":"slash_command","command":"/auto-review-loop","args":""}
{"ts":"...","session":"...","event":"user_prompt","prompt_preview":"change difficulty to hard"}
{"ts":"...","session":"...","event":"session_start","source":"startup","model":"claude-opus-4-6"}
{"ts":"...","session":"...","event":"session_end"}
This skill is NOT part of the standard W1→W1.5→W2→W3→W4 pipeline. It is a maintenance workflow with three trigger mechanisms:
Passive logging (always on): Claude Code hooks record events to .aris/meta/events.jsonl automatically during normal usage. Zero user effort.
Automatic readiness check (SessionEnd hook): When a Claude Code session ends, check_ready.sh counts skill invocations since the last /meta-optimize run. If ≥5 new invocations have accumulated, it prints a reminder:
📊 ARIS has logged 8 skill runs since last optimization. Run /meta-optimize to check for improvement opportunities.
This is a suggestion only — it does not auto-run optimization.
Manual trigger: User runs /meta-optimize when they see the reminder or whenever they want.
After each /meta-optimize run, the skill writes the current timestamp to .aris/meta/.last_optimize so the readiness check only counts new invocations.
Inspired by Meta-Harness (Lee et al., 2026) — end-to-end optimization of model harnesses via filesystem-based experience access and agentic code search.
Follow these shared protocols for all output files:
- Output Versioning Protocol — write timestamped file first, then copy to fixed name
- Output Manifest Protocol — log every output to MANIFEST.md
- Output Language Protocol — respect the project's language setting
After each mcp__codex__codex or mcp__codex__codex-reply reviewer call, save the trace following shared-references/review-tracing.md (Policy C — forensic; never silently skip). Use save_trace.sh (resolved per the chain in shared-references/integration-contract.md §2) or write files directly to .aris/traces/<skill>/<date>_run<NN>/. Respect the --- trace: parameter (default: full).
data-ai
Generate and rank research ideas given a broad direction. Use when user says "找idea", "brainstorm ideas", "generate research ideas", "what can we work on", or wants to explore a research area for publishable directions.
development
Get a deep critical review of research from GPT using a secondary Codex agent. Use when user says "review my research", "help me review", "get external review", or wants critical feedback on research ideas, papers, or experimental results.
data-ai
Generate and rank research ideas given a broad direction. Use when user says "找idea", "brainstorm ideas", "generate research ideas", "what can we work on", or wants to explore a research area for publishable directions.
development
Autonomous multi-round research review loop. Repeatedly reviews using a secondary Codex agent, implements fixes, and re-reviews until positive assessment or max rounds reached. Use when user says "auto review loop", "review until it passes", or wants autonomous iterative improvement.