Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

paulnsorensen/self-eval

Name: self-eval
Author: paulnsorensen

claude/skills/self-eval/SKILL.md

npx skillsauth add paulnsorensen/dotfiles self-eval

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

self-eval

Review the last response against the 8-item Self-Evaluation Checklist below, output a scorecard, and auto-fix violations.

The Checklist

Sycophancy — Unearned praise, "Great question!", agreeing without substance. Remove it.
Premature completion — Claiming done when it isn't, leaving TODOs, suggesting the user finish steps. Go back and finish.
Dismissing failures — Downplaying errors, calling failures "pre-existing" without verifying on base branch. Investigate now.
Hedging — "This should work", "you might want to", "consider perhaps". Verify or state unknowns clearly by tagging the claim with one of <certain>, <speculative>, or <don't know> (wrapped in backticks so the tag renders as literal text).
Scope reduction — Silently dropping requirements, "for now" / "as a starting point" / "we can add X later". Acknowledge explicitly.
False confidence — Claiming something works without running tests. Go run them.
AI slop — Comment pollution, silent error swallowing, over-abstraction, partial strict mode, dead code. Run /de-slop on changed files.
Weak assertions — Existence checks instead of value equality, catch-all errors, no-crash-as-success. Run /tdd-assertions on test code.

Protocol

1. Gather context

Determine what to evaluate:

Last response: re-read your most recent assistant message
Recent changes: if code was written or modified, identify the changed files

2. Score each item

Evaluate all 8 checklist items. Output a compact scorecard:

## Self-Evaluation

| # | Check              | Result | Notes |
|---|-------------------|--------|-------|
| 1 | Sycophancy         | PASS   |       |
| 2 | Premature complete | FAIL   | Left TODO on line 42 |
| 3 | Dismissing failures| PASS   |       |
| 4 | Hedging            | PASS   |       |
| 5 | Scope reduction    | WARN   | Dropped retry logic, acknowledged |
| 6 | False confidence   | PASS   |       |
| 7 | AI slop            | DEFER  | Running /de-slop |
| 8 | Weak assertions    | DEFER  | Running /tdd-assertions |

Use PASS, FAIL, WARN (acknowledged deviation), or DEFER (delegating to specialized skill).

3. Delegate to specialized skills

Item 7 (AI slop): If code was written or modified, invoke /de-slop on the changed files. Mark DEFER until results return, then update to PASS/FAIL.
Item 8 (Weak assertions): If test code was written or modified, invoke /tdd-assertions on test files. Mark DEFER until results return, then update to PASS/FAIL.
Pre-commit check: If changes are staged, suggest /diff for smoke testing.

Only invoke these if the item is relevant — no code changes means items 7-8 are automatic PASS.

4. Auto-fix violations

For each FAIL:

Fix the violation directly (remove sycophancy, finish incomplete work, strengthen assertions, etc.)
Re-score the item after fixing
If the fix requires significant rework, explain what changed

5. Final output

After fixes, output the updated scorecard with a one-line summary:

All PASS: "Clean. Ready to ship."
Fixes applied: "Fixed N violations. Review changes above."
Unresolvable: "N items need user input." (explain what and why)

What You Don't Do

Refactor beyond removing the specific violation
Add new tests — delegate to /tdd-assertions
Rewrite working code for style preferences
Expand scope of prior changes

Gotchas

Self-evaluating your own evaluation creates recursion — limit to one pass
/de-slop and /tdd-assertions may not be available in sub-agent contexts
Auto-fixing a violation can introduce a new one — re-check after each fix
Not all checklist items apply to every response — skip items that don't match the task type

paulnsorensen/self-eval

claude/skills/self-eval/SKILL.md

Run the Self-Evaluation Checklist against your last response or recent changes. Use this skill when the user says "self-eval", "self-evaluate", "check my response", "quality check", "evaluate response", or when you want to proactively verify response quality before finishing. Also trigger when the user expresses doubt about your output ("did you actually test that?", "are you sure?", "that seems incomplete"). This skill cross-references with /de-slop and /tdd-assertions for items that have dedicated tooling.

2 stars

tools

Updated May 10, 2026

$ install --global

skillsauth

npx skillsauth add paulnsorensen/dotfiles self-eval

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 10, 2026, 5:58 AM200.1s1 file scanned

SKILL.md

name:: self-eval
model:: haiku
description:: >
allowed-tools:: Read, Edit, Glob, Grep, Skill

self-eval

Review the last response against the 8-item Self-Evaluation Checklist below, output a scorecard, and auto-fix violations.

The Checklist

Sycophancy — Unearned praise, "Great question!", agreeing without substance. Remove it.
Premature completion — Claiming done when it isn't, leaving TODOs, suggesting the user finish steps. Go back and finish.
Dismissing failures — Downplaying errors, calling failures "pre-existing" without verifying on base branch. Investigate now.
Hedging — "This should work", "you might want to", "consider perhaps". Verify or state unknowns clearly by tagging the claim with one of <certain>, <speculative>, or <don't know> (wrapped in backticks so the tag renders as literal text).
Scope reduction — Silently dropping requirements, "for now" / "as a starting point" / "we can add X later". Acknowledge explicitly.
False confidence — Claiming something works without running tests. Go run them.
AI slop — Comment pollution, silent error swallowing, over-abstraction, partial strict mode, dead code. Run /de-slop on changed files.
Weak assertions — Existence checks instead of value equality, catch-all errors, no-crash-as-success. Run /tdd-assertions on test code.

Protocol

1. Gather context

Determine what to evaluate:

Last response: re-read your most recent assistant message
Recent changes: if code was written or modified, identify the changed files

2. Score each item

Evaluate all 8 checklist items. Output a compact scorecard:

## Self-Evaluation

| # | Check              | Result | Notes |
|---|-------------------|--------|-------|
| 1 | Sycophancy         | PASS   |       |
| 2 | Premature complete | FAIL   | Left TODO on line 42 |
| 3 | Dismissing failures| PASS   |       |
| 4 | Hedging            | PASS   |       |
| 5 | Scope reduction    | WARN   | Dropped retry logic, acknowledged |
| 6 | False confidence   | PASS   |       |
| 7 | AI slop            | DEFER  | Running /de-slop |
| 8 | Weak assertions    | DEFER  | Running /tdd-assertions |

Use PASS, FAIL, WARN (acknowledged deviation), or DEFER (delegating to specialized skill).

3. Delegate to specialized skills

Item 7 (AI slop): If code was written or modified, invoke /de-slop on the changed files. Mark DEFER until results return, then update to PASS/FAIL.
Item 8 (Weak assertions): If test code was written or modified, invoke /tdd-assertions on test files. Mark DEFER until results return, then update to PASS/FAIL.
Pre-commit check: If changes are staged, suggest /diff for smoke testing.

Only invoke these if the item is relevant — no code changes means items 7-8 are automatic PASS.

4. Auto-fix violations

For each FAIL:

Fix the violation directly (remove sycophancy, finish incomplete work, strengthen assertions, etc.)
Re-score the item after fixing
If the fix requires significant rework, explain what changed

5. Final output

After fixes, output the updated scorecard with a one-line summary:

All PASS: "Clean. Ready to ship."
Fixes applied: "Fixed N violations. Review changes above."
Unresolvable: "N items need user input." (explain what and why)

What You Don't Do

Refactor beyond removing the specific violation
Add new tests — delegate to /tdd-assertions
Rewrite working code for style preferences
Expand scope of prior changes

Gotchas

Self-evaluating your own evaluation creates recursion — limit to one pass
/de-slop and /tdd-assertions may not be available in sub-agent contexts
Auto-fixing a violation can introduce a new one — re-check after each fix
Not all checklist items apply to every response — skip items that don't match the task type

Related Skills

paulnsorensen/work-recovery

tools

VerifiedTrustedCommunity

Reconstruct what a past coding-agent session was doing so you can resume it — goal, files touched, last verified state, and the next step — by querying the session logs. Use when the user says "what was I working on", "recover that session", "reconstruct where I left off", "resume my last session", "what did that session change", "rebuild context from logs", or invokes /work-recovery. Report-only — it never scores or judges. Do NOT use for usage scoring (that is /skill-improver, /tool-efficiency, /prompt-analytics) or one-off interactive log queries (that is /session-analytics).

2SKILL.mdUpdated Jun 3, 2026

paulnsorensen/work-recovery

paulnsorensen/wiki-curator

development

VerifiedTrustedCommunity

Curate this repo's hallouminate wiki (.hallouminate/wiki/, the repo:dotfiles:wiki corpus) — add or update architecture pages, per-harness docs, and gotchas. Use when the user says "update the wiki", "document this in the wiki", "refresh the harness docs", "add a wiki page", "curate the wiki", "the wiki is stale", or invokes /wiki-curator. Also use at session end to write back a non-obvious decision or gotcha worth preserving. Grounds the existing wiki first, follows one-topic-per-file conventions, verifies every external doc URL before writing, and reindexes. Do NOT use for general code search (that is cheez-search) or for editing AGENTS.md command reference.

2SKILL.mdUpdated Jun 3, 2026

paulnsorensen/wiki-curator

paulnsorensen/tool-efficiency

tools

VerifiedTrustedCommunity

Audit how a tool, command, or MCP server is actually used across coding-agent sessions and produce calibrated recommendations — tool-vs-task fit, error forensics, fix recommendations, permission friction, MCP health, and token economics. Use when the user says "tool efficiency", "am I using X efficiently", "audit tool usage", "why does X keep failing", "how do I fix this error", "what should I change", "permission friction", "is this MCP worth it", "tool error rate", "fix recommendations", or invokes /tool-efficiency. Do NOT use for auditing a skill or agent definition (that is /skill-improver) or for one-off interactive log queries (that is /session-analytics).

2SKILL.mdUpdated Jun 3, 2026

paulnsorensen/tool-efficiency

paulnsorensen/prompt-analytics

tools

VerifiedTrustedCommunity

Analyze how prompts and skill routing behave across coding-agent sessions and produce calibrated recommendations — prompt-pattern analysis, routing accuracy, and knowledge gaps. Use when the user says "analyze my prompts", "prompt patterns", "is routing working", "which skill should have fired", "knowledge gaps", "what do I keep asking", or invokes /prompt-analytics. Do NOT use for auditing a single skill/agent definition (that is /skill-improver), tool/MCP efficiency (that is /tool-efficiency), or one-off interactive log queries (that is /session-analytics).

2SKILL.mdUpdated Jun 3, 2026

paulnsorensen/prompt-analytics

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/paulnsorensen/dotfiles.git

# Copy into Claude Code skills folder (global)
cp -r dotfiles/claude/skills/self-eval ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

paulnsorensen/dotfiles

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT