skills/persona-evaluator/SKILL.md
Audit any OpenPersona (or peer LLM-agent) persona in three complementary modes: structural (CLI, deterministic, CI-friendly: 4 Layers × 5 Systemic Concepts × Constitution gate with role-aware severity), semantic white-box (LLM reads pack-content JSON and scores Soul-narrative quality via rubrics), and semantic black-box (LLM evaluates a remote agent it cannot read on disk, via A2A handshake / consent-probe / passive observation, with confidence caps). Produces quality reports with dimension scores, strengths, and actionable improvements. Use when asked to evaluate, audit, score, review, self-review, peer-review, or black-box review an agent.
npx skillsauth add acnlabs/openpersona persona-evaluatorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Score any OpenPersona persona pack against the 4+5 framework standard: 4 Layers (Soul · Body · Faculty · Skill) × 5 Systemic Concepts (Evolution · Economy · Vitality · Social · Rhythm) + Constitution compliance gate.
persona-evaluator reads persona.json, generated artifacts, and soul files to produce a structured 9-dimension report — calibrated to the OpenPersona quality standard, with role-aware severity and three modes for self / peer / black-box review.
# Evaluate an installed persona (static / structural)
npx openpersona evaluate <slug>
# JSON output (for scripting or CI)
npx openpersona evaluate <slug> --json
# Save report to file (always JSON; --json not needed alongside --output)
npx openpersona evaluate <slug> --output report.json
# Embed evaluable persona content (Soul/character/behavior-guide) so an
# LLM evaluator (this skill, acting through an agent) can also judge
# quality semantically — not just structurally
npx openpersona evaluate <slug> --pack-content
persona-evaluator runs in three complementary modes. Pick the mode based on what the user asks before reading the rest of this file.
| User asks | Mode | How | Confidence |
| ---------------------------------------------------- | ----------------------------- | -------------------------------------------------------------------- | --------------------------- |
| "CI / gate persona quality" | structural | npx openpersona evaluate <slug> | deterministic |
| "Polish review of my own pack" | semantic white-box (self) | ... evaluate <slug> --pack-content, then apply rubric in self-mode | high |
| "Peer-review a pack I have on disk" | semantic white-box (peer) | same command, peer-mode rubric | high |
| "Review agent X" where X is remote / non-OpenPersona | semantic black-box | A2A handshake → consent + probe → passive, in that order | mid (cap 8/10) or low (cap 6/10) |
Structural is the default. Switch to semantic only when the user explicitly asks for narrative quality review (e.g. "evaluate me semantically", "self-review my pack", "qualitative audit"). Switch to black-box only when you cannot read the subject's persona.json on disk.
Sections below cover each mode in depth: structural (What Gets Scored), semantic white-box (Semantic Evaluation), and semantic black-box (Black-box Semantic Evaluation).
The structural CLI scores 9 dimensions + the Constitution gate. Severity (strict / normal / lenient) is set per dimension by the persona's declared role.
| Layer / Concept | Dimension | Looks at |
| ---------------- | ------------------------------ | ------------------------------------------------------ |
| Soul | identity, character, aesthetic | persona.json Soul block + soul/*.md |
| Body | environment, runtime | hardware/runtime declaration |
| Faculty | tools, capabilities | declared tools and capability budget |
| Skill | external skill packs | declared skill links and trust levels |
| Evolution | learning loops | evolution.instance and immutable traits |
| Economy | cost / token budgets | declared budgets, fail-closed posture |
| Vitality | health checks | runtime sanity / lifecycle/vitality outputs |
| Social | A2A behavior | agent-card capabilities, peer-eval declarations |
| Rhythm | cadence / activation | invocation cadence and activation conditions |
| Constitution | §1–§5 compliance gate | a hard cap of 3 if any §3 Safety violation is detected |
Each dimension produces a 0–10 score, a list of issues (✗), and suggestions (→). The overall score is a severity-weighted average — see Role-aware scoring.
The structural evaluator already reads soul/identity.role and assigns each dimension a severity. The semantic reviewer must respect those severities (see references/RUBRICS.md for the rubric anchors).
| Role | Strict (must-be-strong) | Lenient (won't be penalised) | Notes |
| ------------- | ----------------------------------- | ------------------------------- | ------------------------------------------------------------------------ |
| assistant | identity, character, faculty | aesthetic | Default. |
| companion | character, aesthetic, evolution | faculty, skill | Soul-heavy; tooling thinness is OK. |
| tool | faculty, skill, vitality | character, aesthetic, evolution | Behavior matters; backstory does not. |
| expert | faculty, skill, identity | aesthetic | Domain authority; soft Soul OK if identity.bio carries the credential. |
| guide | character, social, evolution | faculty | Conversation steward. |
| entertainer | character, aesthetic, speakingStyle | faculty, skill | Voice and vibe are the product. |
If soul/identity.role is missing or unrecognised, the evaluator falls back to assistant.
Each dimension shows:
✓ identity 9/10 (strict)
✗ character.boundaries 4/10 (strict)
✗ no hard limits declared in `boundaries`
→ add at least one enforceable rule (cite §3 Safety)
The summary footer prints overall score, Constitution status, and a sorted list of dimensions by severity.
| Band | Score | Meaning | | --------- | ----- | ----------------------------------------------------- | | Excellent | 9–10 | Production-ready, distinctive. | | Good | 7–8 | Ship-able with minor polish. | | Adequate | 5–6 | Functional, identifiable gaps. | | Poor | 3–4 | Needs structural fixes before use. | | Broken | 0–2 | Missing required content or violates Constitution §3. |
A Constitution §3 violation caps the overall score at 3 regardless of other dimensions.
Structural mode is deterministic. Semantic mode is the LLM agent's qualitative review — narrative quality of background, personality, speakingStyle, voice fidelity in behavior-guide.md, etc. Two scenarios share the same procedure:
Trigger semantic mode only when the user explicitly asks for it — phrases like "evaluate me semantically", "self-review my pack", "peer-review this persona", "qualitative audit". Otherwise, default to structural mode.
--pack-content: npx openpersona evaluate <slug> --pack-content
constitution.passed === false. Do not produce semantic scores when §3 has failed; the structural blockers must be fixed first.report.packContent from the JSON. It includes (where defined): character.{background,personality,speakingStyle,boundaries}, immutableTraits, aesthetic.{emoji,creature,vibe}, and a whitelisted soulDocs map keyed by filename (behavior-guide.md, self-narrative.md, identity.md — only those that exist).severity already attached to each dimension by the structural evaluator (strict / normal / lenient) to gate which checks count.## White-box format). Keep it under ~500 words.You are evaluating your own pack. The user has invited you (the host persona) to review yourself.
You are evaluating a different persona. The user has invited you (Reviewer-X) to look at Subject-Y.
strict-Skill assistant, you will instinctively want Subject-Y to also be Skill-rigorous, even if Subject-Y is a companion.role and weights block before each rubric. Lower expectations for lenient dimensions even if you personally find them important.companion but reads like assistant), say so as a separate cross-cutting observation — don't silently re-score against your preferred role.Everything above assumes you can read the subject's persona.json and soul/*.md. That's false in the most common peer-audit scenario: you're asked to evaluate another agent whose pack you cannot read. In that case the rubrics are the same; what changes is the data source and the confidence cap.
Three data-source tiers, in descending fidelity:
| Tier | Data source | Consent | Confidence | Cap (per-field & overall) |
| ---- | --------------------------------------------------------------------------- | --------------------------------- | ---------- | ------------------------------------ |
| 1 | A2A pack-content handshake — subject voluntarily ships its evaluable JSON | Reply itself is the consent token | high | none — produces a white-box report |
| 2 | Explicit consent + structured probe set (10 core + optional deep-dives) | Yes, before any probe | mid | 8/10 |
| 3 | Passive observation of voluntarily-public material | No (must label the report) | low | 6/10 |
Tier 1 produces the regular white-box report (header line: Data source: A2A pack-content handshake from <subject-slug>). Tier 2 and Tier 3 produce a separate black-box report. Never escalate tiers silently.
Full mechanics — handshake schema, probe table, identity-coherence dimension, confidence-cap justification, and hard rules — live in references/BLACK-BOX.md.
The black-box report format is in references/REPORT-FORMAT.md (## Black-box format).
Constitution violations are hard blocks — they cap the score at 3 regardless of everything else. Open soul/behavior-guide.md and remove any capability declarations that violate §3 Safety.
Issues (✗) indicate missing required elements or broken configurations. Suggestions (→) are optional enhancements. Prioritize issues in low-scoring dimensions.
For Soul-layer fixes (background depth, speaking style, boundaries):
npx openpersona refine <slug> --emit # request refinement via Signal Protocol
# (host LLM generates improvements)
npx openpersona refine <slug> --apply # apply approved refinement
For structural fixes (missing faculty, missing minTrustLevel):
Edit persona.json directly and regenerate:
npx openpersona update <slug> # regenerate from updated persona.json
After applying any fix, re-run npx openpersona evaluate <slug> (see Quick Start) to verify the score improved and Constitution gate passes.
# .github/workflows/persona-quality.yml
- name: Evaluate persona quality
run: |
npx openpersona evaluate ${{ env.PERSONA_SLUG }} --output report.json
SCORE=$(jq '.overallScore' report.json)
if [ "$SCORE" -lt 6 ]; then
echo "Persona quality score $SCORE < 6 — review required"
exit 1
fi
| Skill | Relationship |
| --------------------- | --------------------------------------------------------------------------- |
| open-persona | Creates personas that persona-evaluator audits — the production/QA pair |
| anyone-skill | Distills personas that can be evaluated with this skill after generation |
| open-persona refine | The fix path after persona-evaluator identifies Soul-layer improvements |
persona-evaluator ships bundled with the OpenPersona framework and is available immediately after installing it:
npm install -g openpersona
# persona-evaluator is included — no separate install needed
npx openpersona evaluate <slug>
A standalone distributable is also available at acnlabs/persona-evaluator on GitHub and listed on openpersona.co/skill/persona-evaluator.
Current version: 0.3.4 (also in frontmatter metadata.version).
See CHANGELOG.md for full version history, rationale, test surface, and re-validation evidence. The deeper rubric review trail lives in docs/SKILL-RUBRIC.md and docs/SKILL-RUBRIC-SESSION-2.md in the main OpenPersona repo.
tools
Distill any commercial entity into a personalized brand agent — a living brand persona with authentic voice, declared service capabilities, and a standard service contract. Every commercial entity has a brand: a name, a style, a way of showing up in the world. This skill exists so that a street vendor, a family clinic, and a global chain can all have their own agent on equal footing. Supports both distillation from existing brand content and declaration from scratch.
development
A local-first personal AI double framework that helps users build, govern, and evolve their own digital self with clear
development
A complete pipeline to build your AI Second Me: distill your identity from personal data, grow a private knowledge base, train a local model, and govern what gets shared.
development
Your AI Founder Partner for building and scaling startups — diagnose your stage, run hypothesis experiments, make pricing decisions, design growth loops, and ship weekly execution reviews.