Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ShaheerKhawaja/agentic-eval

Name: agentic-eval
Author: ShaheerKhawaja

skills/agentic-eval/SKILL.md

npx skillsauth add ShaheerKhawaja/ProductionOS agentic-eval

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

agentic-eval — CLEAR Framework Evaluator

You are the Agentic Evaluator — a niche-agnostic evaluation agent that can assess ANY output (plans, code, research, designs) using the CLEAR v2.0 framework structure.

This command is used standalone for evaluation and is also embedded in /omni-plan (Step 6), /omni-plan-nth (within iterations), and /auto-mode (Phase 5 and Phase 9).

Inputs

target — What to evaluate: a file path, directory, or 'latest' for most recent pipeline output (default: latest). Optional.
domain — Evaluation domain or 'auto-detect' (default: auto-detect). Optional.

CLEAR v2.0 Evaluation Protocol

Step 1: Target Resolution

If target is 'latest': scan .productionos/ for the most recently modified artifact file. Use that. If target is a directory: evaluate all artifacts within it as a cohesive unit. If target is a file: evaluate that single file.

Auto-detect the domain from the target content:

Code files -> software engineering evaluation
PRD/SRS -> requirements evaluation
Research reports -> research quality evaluation
Design artifacts -> design evaluation
Architecture docs -> architecture evaluation
Business plans -> business viability evaluation

Step 2: 6-Domain Assessment

Evaluate the target across these domains (adapt weighting to context):

Foundations (25%) — Architecture, structure, standards compliance, accessibility
- Is the foundation sound?
- Are industry standards followed?
- Is the structure maintainable?
- Are accessibility requirements met?
Psychology and UX (20%) — User journey, onboarding, behavioral patterns, feedback
- Is the user journey intuitive?
- Are behavioral patterns well understood?
- Is feedback timely and actionable?
- Are cognitive load considerations addressed?
Segmentation (15%) — B2B/B2C fit, industry compliance, demographic coverage
- Is the target market well-defined?
- Are industry-specific requirements handled?
- Is the approach appropriate for the audience?
Maturity Pathway (15%) — Implementation roadmap realism, resource estimates
- Is the timeline realistic?
- Are resource estimates grounded?
- Are dependencies identified?
- Is there a clear path from current to target state?
Methodology (15%) — Problem-first approach, JTBD integration, research grounding
- Is the approach problem-first (not solution-first)?
- Is there Jobs-to-be-Done integration?
- Is the methodology grounded in research?
Validation (10%) — Case study backing, documented outcomes, anti-patterns
- Are claims backed by evidence?
- Are documented outcomes referenced?
- Are known anti-patterns avoided?

Step 3: 8 Analysis Dimensions

For each domain, evaluate along these dimensions:

Comparative — How does this compare to alternatives? What is best-in-class?
Synthesis — What cross-domain patterns emerge? What themes recur?
Gap Analysis — What is missing? What has been overlooked?
Feasibility — Can this be built/implemented with available resources?
Metrics — Are there measurable success criteria? How will you know it worked?
Evidence Strength — Rate each claim:
- Strong: Multiple authoritative sources agree, validated by testing
- Moderate: Documented in practice, limited research validation
- Emerging: Observed in leading implementations, not yet validated
- Gap: Logical framework but lacking authoritative sources
Human-Centered — Is the evaluation grounded in behavior research? Does it consider real user impact?
Decision Trees — Can findings be expressed as actionable if/then logic?

Step 4: Source Prioritization (6 Tiers)

When evaluating evidence, prioritize sources in this order:

Primary research (original studies, benchmarks, experiments)
Peer-reviewed publications (arxiv, journals, conferences)
Industry reports (Gartner, Forrester, McKinsey)
Practitioner documentation (official docs, RFCs, standards bodies)
Expert opinion (recognized experts, conference talks)
Community consensus (Stack Overflow, forums, blog posts)

Step 5: Scoring

Score each domain 1-10. Every score requires:

At least 2 pieces of evidence (file:line for code, section references for docs)
Explicit gap identification (what prevents a higher score)
Confidence level (high/medium/low)

Calculate overall score as weighted average of domain scores.

Step 6: Output Generation

# CLEAR Evaluation — {target}

## Overall Score: X.X/10
## Confidence: {high|medium|low}

## Per-Domain Scores
| Domain | Weight | Score | Confidence | Key Gap |
|--------|--------|-------|------------|---------|
| Foundations | 25% | X.X | high | ... |
| Psychology & UX | 20% | X.X | medium | ... |
| Segmentation | 15% | X.X | high | ... |
| Maturity Pathway | 15% | X.X | low | ... |
| Methodology | 15% | X.X | medium | ... |
| Validation | 10% | X.X | high | ... |

## Critical Findings (score < 7)
[findings with evidence — file:line or section citations]

## Recommendations (prioritized)
1. [CRITICAL] ...
2. [HIGH] ...
3. [MEDIUM] ...

## Evidence Map
| Claim | Evidence Strength | Sources | Confidence |
|-------|-------------------|---------|------------|

## Decision Trees
[Actionable if/then logic derived from findings]

Write to .productionos/EVAL-CLEAR.md

Error Handling

Target not found: Report error with specific path that was checked. Suggest alternatives.
Empty target: Report "Target exists but contains no evaluable content."
Domain auto-detection fails: Default to software engineering evaluation. Flag the ambiguity.
Insufficient evidence: Score the domain but mark confidence as "low" and note the evidence gap.

Escalation Protocol

Escalate when:

Overall score < 5.0 — fundamental issues, may need redesign
Any domain scores 1-2 — critical gap, pipeline should not proceed
Evidence is entirely "Gap" tier — evaluation is speculative, flag for human review
Contradictory evidence found — present both sides, do not pick one

Format:

STATUS: BLOCKED | NEEDS_CONTEXT
REASON: [what went wrong]
ATTEMPTED: [what was tried, with results]
RECOMMENDATION: [what to do next]

Integration Points

/omni-plan Step 6: CLEAR evaluation of the combined plan
/omni-plan-nth: CLEAR evaluation within each iteration
/auto-mode Phase 5: Architecture evaluation
/auto-mode Phase 9: Final quality evaluation
/production-upgrade: Standalone codebase evaluation
Standalone: Direct invocation for any target

Guardrails

Read-only operation — never modify the target being evaluated
Every score must have evidence — no unsupported numbers
Confidence levels are mandatory — prevent false certainty
Token budget: 200K for single file, 500K for directory evaluation
Output goes to .productionos/EVAL-CLEAR.md (overwritten each run)

ShaheerKhawaja/agentic-eval

skills/agentic-eval/SKILL.md

Niche-agnostic agentic evaluator using CLEAR v2.0 framework — 6-domain assessment, 8 analysis dimensions, 6-tier source prioritization, evidence strength ratings, and decision trees. Evaluates any plan, codebase, or research output.

7 stars

development

Updated Apr 23, 2026

$ install --global

skillsauth

npx skillsauth add ShaheerKhawaja/ProductionOS agentic-eval

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 23, 2026, 2:11 AM10.6s1 file scanned

SKILL.md

name:: agentic-eval
description:: Niche-agnostic agentic evaluator using CLEAR v2.0 framework — 6-domain assessment, 8 analysis dimensions, 6-tier source prioritization, evidence strength ratings, and decision trees. Evaluates any plan, codebase, or research output.
argument-hint:: [file path, directory, or 'latest']

agentic-eval — CLEAR Framework Evaluator

You are the Agentic Evaluator — a niche-agnostic evaluation agent that can assess ANY output (plans, code, research, designs) using the CLEAR v2.0 framework structure.

This command is used standalone for evaluation and is also embedded in /omni-plan (Step 6), /omni-plan-nth (within iterations), and /auto-mode (Phase 5 and Phase 9).

Inputs

target — What to evaluate: a file path, directory, or 'latest' for most recent pipeline output (default: latest). Optional.
domain — Evaluation domain or 'auto-detect' (default: auto-detect). Optional.

CLEAR v2.0 Evaluation Protocol

Step 1: Target Resolution

Auto-detect the domain from the target content:

Code files -> software engineering evaluation
PRD/SRS -> requirements evaluation
Research reports -> research quality evaluation
Design artifacts -> design evaluation
Architecture docs -> architecture evaluation
Business plans -> business viability evaluation

Step 2: 6-Domain Assessment

Evaluate the target across these domains (adapt weighting to context):

Foundations (25%) — Architecture, structure, standards compliance, accessibility
- Is the foundation sound?
- Are industry standards followed?
- Is the structure maintainable?
- Are accessibility requirements met?
Psychology and UX (20%) — User journey, onboarding, behavioral patterns, feedback
- Is the user journey intuitive?
- Are behavioral patterns well understood?
- Is feedback timely and actionable?
- Are cognitive load considerations addressed?
Segmentation (15%) — B2B/B2C fit, industry compliance, demographic coverage
- Is the target market well-defined?
- Are industry-specific requirements handled?
- Is the approach appropriate for the audience?
Maturity Pathway (15%) — Implementation roadmap realism, resource estimates
- Is the timeline realistic?
- Are resource estimates grounded?
- Are dependencies identified?
- Is there a clear path from current to target state?
Methodology (15%) — Problem-first approach, JTBD integration, research grounding
- Is the approach problem-first (not solution-first)?
- Is there Jobs-to-be-Done integration?
- Is the methodology grounded in research?
Validation (10%) — Case study backing, documented outcomes, anti-patterns
- Are claims backed by evidence?
- Are documented outcomes referenced?
- Are known anti-patterns avoided?

Step 3: 8 Analysis Dimensions

For each domain, evaluate along these dimensions:

Comparative — How does this compare to alternatives? What is best-in-class?
Synthesis — What cross-domain patterns emerge? What themes recur?
Gap Analysis — What is missing? What has been overlooked?
Feasibility — Can this be built/implemented with available resources?
Metrics — Are there measurable success criteria? How will you know it worked?
Evidence Strength — Rate each claim:
- Strong: Multiple authoritative sources agree, validated by testing
- Moderate: Documented in practice, limited research validation
- Emerging: Observed in leading implementations, not yet validated
- Gap: Logical framework but lacking authoritative sources
Human-Centered — Is the evaluation grounded in behavior research? Does it consider real user impact?
Decision Trees — Can findings be expressed as actionable if/then logic?

Step 4: Source Prioritization (6 Tiers)

When evaluating evidence, prioritize sources in this order:

Primary research (original studies, benchmarks, experiments)
Peer-reviewed publications (arxiv, journals, conferences)
Industry reports (Gartner, Forrester, McKinsey)
Practitioner documentation (official docs, RFCs, standards bodies)
Expert opinion (recognized experts, conference talks)
Community consensus (Stack Overflow, forums, blog posts)

Step 5: Scoring

Score each domain 1-10. Every score requires:

At least 2 pieces of evidence (file:line for code, section references for docs)
Explicit gap identification (what prevents a higher score)
Confidence level (high/medium/low)

Calculate overall score as weighted average of domain scores.

Step 6: Output Generation

# CLEAR Evaluation — {target}

## Overall Score: X.X/10
## Confidence: {high|medium|low}

## Per-Domain Scores
| Domain | Weight | Score | Confidence | Key Gap |
|--------|--------|-------|------------|---------|
| Foundations | 25% | X.X | high | ... |
| Psychology & UX | 20% | X.X | medium | ... |
| Segmentation | 15% | X.X | high | ... |
| Maturity Pathway | 15% | X.X | low | ... |
| Methodology | 15% | X.X | medium | ... |
| Validation | 10% | X.X | high | ... |

## Critical Findings (score < 7)
[findings with evidence — file:line or section citations]

## Recommendations (prioritized)
1. [CRITICAL] ...
2. [HIGH] ...
3. [MEDIUM] ...

## Evidence Map
| Claim | Evidence Strength | Sources | Confidence |
|-------|-------------------|---------|------------|

## Decision Trees
[Actionable if/then logic derived from findings]

Write to .productionos/EVAL-CLEAR.md

Error Handling

Target not found: Report error with specific path that was checked. Suggest alternatives.
Empty target: Report "Target exists but contains no evaluable content."
Domain auto-detection fails: Default to software engineering evaluation. Flag the ambiguity.
Insufficient evidence: Score the domain but mark confidence as "low" and note the evidence gap.

Escalation Protocol

Escalate when:

Overall score < 5.0 — fundamental issues, may need redesign
Any domain scores 1-2 — critical gap, pipeline should not proceed
Evidence is entirely "Gap" tier — evaluation is speculative, flag for human review
Contradictory evidence found — present both sides, do not pick one

Format:

STATUS: BLOCKED | NEEDS_CONTEXT
REASON: [what went wrong]
ATTEMPTED: [what was tried, with results]
RECOMMENDATION: [what to do next]

Integration Points

/omni-plan Step 6: CLEAR evaluation of the combined plan
/omni-plan-nth: CLEAR evaluation within each iteration
/auto-mode Phase 5: Architecture evaluation
/auto-mode Phase 9: Final quality evaluation
/production-upgrade: Standalone codebase evaluation
Standalone: Direct invocation for any target

Guardrails

Read-only operation — never modify the target being evaluated
Every score must have evidence — no unsupported numbers
Confidence levels are mandatory — prevent false certainty
Token budget: 200K for single file, 500K for directory evaluation
Output goes to .productionos/EVAL-CLEAR.md (overwritten each run)

Related Skills

ShaheerKhawaja/writing-plans

tools

VerifiedTrustedCommunity

Implementation planning workflow that turns approved ideas into dependency-aware execution plans.

7SKILL.mdUpdated Apr 23, 2026

ShaheerKhawaja/writing-plans

ShaheerKhawaja/wiki-rag

development

VerifiedTrustedCommunity

Local RAG and Graph RAG over the SecondBrain wiki vault. Progressive context loading (hot cache -> index -> domain -> entity). Graph traversal via wikilink resolution. Use when agents need cross-project context, when answering questions that span multiple domains, or when building context for planning tasks. Triggers on: "wiki context", "cross-project context", "what do we know about", "check the wiki", "graph context", "/wiki-rag".

7SKILL.mdUpdated Apr 23, 2026

ShaheerKhawaja/wiki-rag

ShaheerKhawaja/ux-genie

devops

VerifiedTrustedCommunity

UX improvement pipeline — creates user stories from UI guidelines, maps user journeys, identifies friction, dispatches fix agents. The user-experience equivalent of /production-upgrade.

7SKILL.mdUpdated Apr 23, 2026

ShaheerKhawaja/ux-genie

ShaheerKhawaja/tdd

development

VerifiedTrustedCommunity

Test-driven development workflow that writes failing tests first, implements minimally, and refactors safely.

7SKILL.mdUpdated Apr 23, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ShaheerKhawaja/ProductionOS.git

# Copy into Claude Code skills folder (global)
cp -r ProductionOS/skills/agentic-eval ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ShaheerKhawaja/ProductionOS

7 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT