skills/arifos-evals/SKILL.md
Run benchmark prompts, collect pass/fail traces, latency, token cost, and false activation rates for each skill. Load when a skill changes behavior or a new version is proposed.
npx skillsauth add ariffazil/openclaw-workspace arifos-evalsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Run benchmark prompts, collect pass/fail traces, latency, token cost, and false activation rates for each skill.
skill-trigger-linter instead).arifos-recursive-audit instead).SKILL.md file proposed for evaluation.evals.json).The evaluation flow is split into two explicit operational phases:
<skill-name>-workspace/iteration-N/).timing.json).grading.json.benchmark.json and generate reports.The output file benchmark.json must classify every execution using standardized taxonomy:
scenario_category: The high-level framework class (e.g. infrastructure_deployment, domain_petrophysics, governance_verification).context_length: Input character/token weight category (short < 4K, medium 4K-16K, long > 16K).task_type: The reasoning dialect of the prompt (code_generation, ast_parsing, decision_reasoning, syntactic_lint).metrics: Nested performance counts:
{
"pass_rate": 0.0,
"latency_ms": 0,
"token_in": 0,
"token_out": 0,
"false_activation": false,
"rollback_triggered": false
}
scenario_category tags.timing.json immediately upon task completion.grading.json.benchmark.json.benchmark.json with standardized category tags is generated in the workspace.{
"skill_name": "arifos-evals",
"version": "1.1.0",
"trigger_phrase": "{{trigger_phrase}}",
"selected_reason": "{{selected_reason}}",
"selected_branch": "iteration-{{N}}",
"latency_ms": 0,
"token_in": 0,
"token_out": 0,
"commands_run": 0,
"artifacts_written": 0,
"postcondition_pass": false,
"human_approval_required": false,
"hold_code": "{{hold_code}}"
}
development
Governed intelligence skill for AAA as the abstraction, attestation, and abduction control plane across arifOS, APEX, A-FORGE, GEOX, WEALTH, WELL, and the ariffazil profile repository. Use when the user asks to explain or design AAA, route agentic work, reduce chaos/entropy in an arifOS federation task, create AREP/task declarations, classify risk, plan multi-repo changes, review governance boundaries, or translate human intent into evidence-backed, authority-safe, recursively agentic workflows. Provides deterministic F1-F13 floor checking, bounded abduction, and FederationReceipt composition.
development
Check every skill’s “use when” and “do not use when” clauses for collisions, missing negatives, and vague verbs like “help,” “assist,” or “improve.” Load when linting, reviewing, or validating trigger boundaries.
development
Bootstrap, design, and package new skills. Load when capturing user intent for a new skill or drafting its initial instruction framework.
content-media
Diagnose which federation services are up, down, or drifting. Produce a prioritized remediation plan.