Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ariffazil/arifOS Evals

Name: arifOS Evals
Author: ariffazil

skills/arifos-evals/SKILL.md

npx skillsauth add ariffazil/openclaw-workspace arifOS Evals

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

arifos-evals (O_Ω Constitutional Layer)

arifOS-ACT Embedding

Before using this skill on any mutating, irreversible, or high-blast-radius task:

ART — Attune (what is the real task?), Recognize (what class of power?), Test (fit · authority · evidence · blast · reversible).
Kernel — Route to arifOS for F1–F13 judgment if action class is Maker/Messenger/Mutator/Destroyer/Sovereign.
ACT — Apply narrow, Constrain scope, Trace witness, STOP before corruption.
Receipt — Leave evidence of what changed, why, and under whose authority.

Purpose

Run benchmark prompts, collect pass/fail traces, latency, token cost, and false activation rates for each skill.

Use When

Evaluating a newly proposed skill draft against a baseline (without skill).
Benchmarking the performance delta of a modified skill against its original version.
Conducting quantitative checks on response times, token efficiency, and execution costs.
Running automated trigger evaluation queries to calculate precision and recall.
Optimizing a skill's description using programmatic feedback loops.

Do Not Use When

Linting triggers for vague verbs or formatting collisions (use skill-trigger-linter instead).
Performing generic codebase audits or links validation (use arifos-recursive-audit instead).
The task requires structural design changes to a skill's logic.

Inputs

Skill Draft: The SKILL.md file proposed for evaluation.
Test Config: A JSON file containing benchmark prompts and expected outputs (e.g. evals.json).
Baseline Snapshot: The original skill version or empty state folder.

Operational Lifecycle Phases

The evaluation flow is split into two explicit operational phases:

Phase 1: The Design Phase

Intent: Establish the parameters, axes, and contexts of the test suite.
Actions:
- Define evaluation axes (precision, latency, token drift, rollback safety).
- Assemble a curated, diverse set of test queries.
- Set up baseline and variant configuration definitions.
- Initialize the metrics target directory (<skill-name>-workspace/iteration-N/).

Phase 2: The Execution Phase

Intent: Trigger the evaluations, record telemetry, programmatically grade the outputs, and generate aggregated benchmarks.
Actions:
- Spawn parallel execution subagent tasks.
- Measure and capture timing logs (timing.json).
- Verify assertions and write the results to grading.json.
- Aggregate metrics into benchmark.json and generate reports.

DevBench-Aligned Metrics Schema

The output file benchmark.json must classify every execution using standardized taxonomy:

scenario_category: The high-level framework class (e.g. infrastructure_deployment, domain_petrophysics, governance_verification).
context_length: Input character/token weight category (short < 4K, medium 4K-16K, long > 16K).
task_type: The reasoning dialect of the prompt (code_generation, ast_parsing, decision_reasoning, syntactic_lint).

metrics: Nested performance counts:

{
  "pass_rate": 0.0,
  "latency_ms": 0,
  "token_in": 0,
  "token_out": 0,
  "false_activation": false,
  "rollback_triggered": false
}

Procedure

Phase 1 (Design): Establish test configs, select baseline variant, and define standard scenario_category tags.
Phase 2 (Execution): Trigger the parallel run suite under iteration directories.
Timing Capture: Record timing.json immediately upon task completion.
Assertion Grading: Validate outputs programmatically against expected invariants and write to grading.json.
Benchmark Compilation: Aggregate results using the DevBench-Aligned Metrics Schema and output to benchmark.json.

Postconditions

A valid benchmark.json with standardized category tags is generated in the workspace.
Pass/fail comparisons are programmatically graded and saved.
Evaluation results do not mix lab-synthetic data with online field telemetry.

Failure Modes & Escalation

Execution Timeout: Parallel subagents hang or fail to return timing data. Action: Terminate execution, record a fail grade, and list the step limit as exceeded.
Grader Divergence: Quantitative grades differ from manual human feedback. Action: Flag the assertions as ambiguous and request manual grading override.

Telemetry per Run

{
  "skill_name": "arifos-evals",
  "version": "1.1.0",
  "trigger_phrase": "{{trigger_phrase}}",
  "selected_reason": "{{selected_reason}}",
  "selected_branch": "iteration-{{N}}",
  "latency_ms": 0,
  "token_in": 0,
  "token_out": 0,
  "commands_run": 0,
  "artifacts_written": 0,
  "postcondition_pass": false,
  "human_approval_required": false,
  "hold_code": "{{hold_code}}"
}

Recursive Scorecard

Activation Precision: [0.0 - 1.0] (Target: >0.95)
Task Completion Rate: [0.0 - 1.0] (Target: >0.98)
Rollback Safety: [0.0 - 1.0] (Target: 1.00)
Context Efficiency: [0.0 - 1.0] (Target: >0.90)
Doc Freshness: [0.0 - 1.0] (Target: 1.00)
Cross-Skill Collision Rate: [0.0 - 1.0] (Target: 0.00)
Human Trust Score: [0.0 - 1.0] (Target: >0.98)

ariffazil/arifOS Evals

skills/arifos-evals/SKILL.md

Run benchmark prompts, collect pass/fail traces, latency, token cost, and false activation rates for each skill. Load when a skill changes behavior or a new version is proposed.

2 stars

data-ai

Updated Jul 10, 2026

$ install --global

skillsauth

npx skillsauth add ariffazil/openclaw-workspace arifOS Evals

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jul 10, 2026, 4:07 AM135.3s5 files scanned

SKILL.md

id:: arifos-evals
name:: arifOS Evals
version:: 1.0.0
description:: Run benchmark prompts, collect pass/fail traces, latency, token cost,
owner:: AAA
risk_tier:: low
language:: true
math:: true
physics:: false
servers:: []
tools:: []
schema_version:: 1
artifact_hash:: pending
layer:: HEXAGON
autonomy_tier:: T2

arifos-evals (O_Ω Constitutional Layer)

arifOS-ACT Embedding

Before using this skill on any mutating, irreversible, or high-blast-radius task:

ART — Attune (what is the real task?), Recognize (what class of power?), Test (fit · authority · evidence · blast · reversible).
Kernel — Route to arifOS for F1–F13 judgment if action class is Maker/Messenger/Mutator/Destroyer/Sovereign.
ACT — Apply narrow, Constrain scope, Trace witness, STOP before corruption.
Receipt — Leave evidence of what changed, why, and under whose authority.

Purpose

Run benchmark prompts, collect pass/fail traces, latency, token cost, and false activation rates for each skill.

Use When

Evaluating a newly proposed skill draft against a baseline (without skill).
Benchmarking the performance delta of a modified skill against its original version.
Conducting quantitative checks on response times, token efficiency, and execution costs.
Running automated trigger evaluation queries to calculate precision and recall.
Optimizing a skill's description using programmatic feedback loops.

Do Not Use When

Linting triggers for vague verbs or formatting collisions (use skill-trigger-linter instead).
Performing generic codebase audits or links validation (use arifos-recursive-audit instead).
The task requires structural design changes to a skill's logic.

Inputs

Skill Draft: The SKILL.md file proposed for evaluation.
Test Config: A JSON file containing benchmark prompts and expected outputs (e.g. evals.json).
Baseline Snapshot: The original skill version or empty state folder.

Operational Lifecycle Phases

The evaluation flow is split into two explicit operational phases:

Phase 1: The Design Phase

Intent: Establish the parameters, axes, and contexts of the test suite.
Actions:
- Define evaluation axes (precision, latency, token drift, rollback safety).
- Assemble a curated, diverse set of test queries.
- Set up baseline and variant configuration definitions.
- Initialize the metrics target directory (<skill-name>-workspace/iteration-N/).

Phase 2: The Execution Phase

Intent: Trigger the evaluations, record telemetry, programmatically grade the outputs, and generate aggregated benchmarks.
Actions:
- Spawn parallel execution subagent tasks.
- Measure and capture timing logs (timing.json).
- Verify assertions and write the results to grading.json.
- Aggregate metrics into benchmark.json and generate reports.

DevBench-Aligned Metrics Schema

The output file benchmark.json must classify every execution using standardized taxonomy:

scenario_category: The high-level framework class (e.g. infrastructure_deployment, domain_petrophysics, governance_verification).
context_length: Input character/token weight category (short < 4K, medium 4K-16K, long > 16K).
task_type: The reasoning dialect of the prompt (code_generation, ast_parsing, decision_reasoning, syntactic_lint).

metrics: Nested performance counts:

{
  "pass_rate": 0.0,
  "latency_ms": 0,
  "token_in": 0,
  "token_out": 0,
  "false_activation": false,
  "rollback_triggered": false
}

Procedure

Phase 1 (Design): Establish test configs, select baseline variant, and define standard scenario_category tags.
Phase 2 (Execution): Trigger the parallel run suite under iteration directories.
Timing Capture: Record timing.json immediately upon task completion.
Assertion Grading: Validate outputs programmatically against expected invariants and write to grading.json.
Benchmark Compilation: Aggregate results using the DevBench-Aligned Metrics Schema and output to benchmark.json.

Postconditions

A valid benchmark.json with standardized category tags is generated in the workspace.
Pass/fail comparisons are programmatically graded and saved.
Evaluation results do not mix lab-synthetic data with online field telemetry.

Failure Modes & Escalation

Execution Timeout: Parallel subagents hang or fail to return timing data. Action: Terminate execution, record a fail grade, and list the step limit as exceeded.
Grader Divergence: Quantitative grades differ from manual human feedback. Action: Flag the assertions as ambiguous and request manual grading override.

Telemetry per Run

{
  "skill_name": "arifos-evals",
  "version": "1.1.0",
  "trigger_phrase": "{{trigger_phrase}}",
  "selected_reason": "{{selected_reason}}",
  "selected_branch": "iteration-{{N}}",
  "latency_ms": 0,
  "token_in": 0,
  "token_out": 0,
  "commands_run": 0,
  "artifacts_written": 0,
  "postcondition_pass": false,
  "human_approval_required": false,
  "hold_code": "{{hold_code}}"
}

Recursive Scorecard

Activation Precision: [0.0 - 1.0] (Target: >0.95)
Task Completion Rate: [0.0 - 1.0] (Target: >0.98)
Rollback Safety: [0.0 - 1.0] (Target: 1.00)
Context Efficiency: [0.0 - 1.0] (Target: >0.90)
Doc Freshness: [0.0 - 1.0] (Target: 1.00)
Cross-Skill Collision Rate: [0.0 - 1.0] (Target: 0.00)
Human Trust Score: [0.0 - 1.0] (Target: >0.98)

Related Skills

ariffazil/XAUUSD-trading-stack

development

VerifiedTrustedCommunity

Federation-wide gold (XAUUSD) trading capability. Python stack, OANDA broker, backtesting, macro signals, RSI strategy. Every organ has a role.

2SKILL.mdUpdated Jul 24, 2026

ariffazil/XAUUSD-trading-stack

ariffazil/wealth-claim-state

development

VerifiedTrustedCommunity

Capital claim state management — tracks claim lifecycle across WEALTH organ.

2SKILL.mdUpdated Jul 24, 2026

ariffazil/wealth-claim-state

ariffazil/warga-constitutional

development

VerifiedTrustedCommunity

Archived constitutional warga placeholder retained only for audit provenance. Do not use for active work; use the live arifOS governance and constitutional skills instead.

2SKILL.mdUpdated Jul 24, 2026

ariffazil/warga-constitutional

ariffazil/warga

testing

VerifiedTrustedCommunity

Warga (citizen) agent skills for AAA federation members. See subdirectories for specialized warga skills.

2SKILL.mdUpdated Jul 24, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ariffazil/openclaw-workspace.git

# Copy into Claude Code skills folder (global)
cp -r openclaw-workspace/skills/arifos-evals ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ariffazil/openclaw-workspace

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT