Name: persona-evaluator
Author: acnlabs

persona-evaluator — Persona Quality Auditor

Score any OpenPersona persona pack against the 4+5 framework standard: 4 Layers (Soul · Body · Faculty · Skill) × 5 Systemic Concepts (Evolution · Economy · Vitality · Social · Rhythm) + Constitution compliance gate.

persona-evaluator reads persona.json, generated artifacts, and soul files to produce a structured 9-dimension report — calibrated to the OpenPersona quality standard, with role-aware severity and three modes for self / peer / black-box review.

Quick Start

# Evaluate an installed persona (static / structural)
npx openpersona evaluate <slug>

# JSON output (for scripting or CI)
npx openpersona evaluate <slug> --json

# Save report to file (always JSON; --json not needed alongside --output)
npx openpersona evaluate <slug> --output report.json

# Embed evaluable persona content (Soul/character/behavior-guide) so an
# LLM evaluator (this skill, acting through an agent) can also judge
# quality semantically — not just structurally
npx openpersona evaluate <slug> --pack-content

Choosing a mode

persona-evaluator runs in three complementary modes. Pick the mode based on what the user asks before reading the rest of this file.

| User asks | Mode | How | Confidence | | ---------------------------------------------------- | ----------------------------- | -------------------------------------------------------------------- | --------------------------- | | "CI / gate persona quality" | structural | npx openpersona evaluate <slug> | deterministic | | "Polish review of my own pack" | semantic white-box (self) | ... evaluate <slug> --pack-content, then apply rubric in self-mode | high | | "Peer-review a pack I have on disk" | semantic white-box (peer) | same command, peer-mode rubric | high | | "Review agent X" where X is remote / non-OpenPersona | semantic black-box | A2A handshake → consent + probe → passive, in that order | mid (cap 8/10) or low (cap 6/10) |

Structural is the default. Switch to semantic only when the user explicitly asks for narrative quality review (e.g. "evaluate me semantically", "self-review my pack", "qualitative audit"). Switch to black-box only when you cannot read the subject's persona.json on disk.

Sections below cover each mode in depth: structural (What Gets Scored), semantic white-box (Semantic Evaluation), and semantic black-box (Black-box Semantic Evaluation).

What Gets Scored

The structural CLI scores 9 dimensions + the Constitution gate. Severity (strict / normal / lenient) is set per dimension by the persona's declared role.

| Layer / Concept | Dimension | Looks at | | ---------------- | ------------------------------ | ------------------------------------------------------ | | Soul | identity, character, aesthetic | persona.json Soul block + soul/*.md | | Body | environment, runtime | hardware/runtime declaration | | Faculty | tools, capabilities | declared tools and capability budget | | Skill | external skill packs | declared skill links and trust levels | | Evolution | learning loops | evolution.instance and immutable traits | | Economy | cost / token budgets | declared budgets, fail-closed posture | | Vitality | health checks | runtime sanity / lifecycle/vitality outputs | | Social | A2A behavior | agent-card capabilities, peer-eval declarations | | Rhythm | cadence / activation | invocation cadence and activation conditions | | Constitution | §1–§5 compliance gate | a hard cap of 3 if any §3 Safety violation is detected |

Each dimension produces a 0–10 score, a list of issues (✗), and suggestions (→). The overall score is a severity-weighted average — see Role-aware scoring.

Role-aware scoring

The structural evaluator already reads soul/identity.role and assigns each dimension a severity. The semantic reviewer must respect those severities (see references/RUBRICS.md for the rubric anchors).

Built-in role profiles

| Role | Strict (must-be-strong) | Lenient (won't be penalised) | Notes | | ------------- | ----------------------------------- | ------------------------------- | ------------------------------------------------------------------------ | | assistant | identity, character, faculty | aesthetic | Default. | | companion | character, aesthetic, evolution | faculty, skill | Soul-heavy; tooling thinness is OK. | | tool | faculty, skill, vitality | character, aesthetic, evolution | Behavior matters; backstory does not. | | expert | faculty, skill, identity | aesthetic | Domain authority; soft Soul OK if identity.bio carries the credential. | | guide | character, social, evolution | faculty | Conversation steward. | | entertainer | character, aesthetic, speakingStyle | faculty, skill | Voice and vibe are the product. |

If soul/identity.role is missing or unrecognised, the evaluator falls back to assistant.

Reading the Report

Each dimension shows:

✓  identity                        9/10  (strict)
✗  character.boundaries            4/10  (strict)
   ✗ no hard limits declared in `boundaries`
   → add at least one enforceable rule (cite §3 Safety)

✓ / ✗ — pass / fail at this dimension's severity threshold.
(strict | normal | lenient) — severity from the role profile.
✗ ... — required issue that must be fixed to pass.
→ ... — optional suggestion (does not block scoring).

The summary footer prints overall score, Constitution status, and a sorted list of dimensions by severity.

Score bands

| Band | Score | Meaning | | --------- | ----- | ----------------------------------------------------- | | Excellent | 9–10 | Production-ready, distinctive. | | Good | 7–8 | Ship-able with minor polish. | | Adequate | 5–6 | Functional, identifiable gaps. | | Poor | 3–4 | Needs structural fixes before use. | | Broken | 0–2 | Missing required content or violates Constitution §3. |

A Constitution §3 violation caps the overall score at 3 regardless of other dimensions.

Semantic Evaluation (LLM-driven)

Structural mode is deterministic. Semantic mode is the LLM agent's qualitative review — narrative quality of background, personality, speakingStyle, voice fidelity in behavior-guide.md, etc. Two scenarios share the same procedure:

Self-evaluation: the host persona reviews its own pack.
Peer-evaluation: an installed evaluator reviews a different persona pack the user supplies.

When to invoke semantic mode

Trigger semantic mode only when the user explicitly asks for it — phrases like "evaluate me semantically", "self-review my pack", "peer-review this persona", "qualitative audit". Otherwise, default to structural mode.

Procedure

Run the structural CLI with --pack-content:

 npx openpersona evaluate <slug> --pack-content

Stop and report immediately if constitution.passed === false. Do not produce semantic scores when §3 has failed; the structural blockers must be fixed first.
Read report.packContent from the JSON. It includes (where defined): character.{background,personality,speakingStyle,boundaries}, immutableTraits, aesthetic.{emoji,creature,vibe}, and a whitelisted soulDocs map keyed by filename (behavior-guide.md, self-narrative.md, identity.md — only those that exist).
Score each present field 0–10 using the rubrics in references/RUBRICS.md. Use the per-dimension severity already attached to each dimension by the structural evaluator (strict / normal / lenient) to gate which checks count.
Emit the report in the white-box format defined in references/REPORT-FORMAT.md (## White-box format). Keep it under ~500 words.

Mode: self-evaluation

You are evaluating your own pack. The user has invited you (the host persona) to review yourself.

What this is for: Catch own blind spots and surface concrete polish targets.
Your bias: Self-flattery and minimisation. You will instinctively justify why your background is "deep enough" or your boundaries are "implied".
Counter-bias instruction: For every per-field score, before deciding the number, write one sentence answering: "If I weren't me, what specifically would I down-score about this field?" Then score.
Acceptable output tone: First person ("My speakingStyle…"), candid about gaps. Avoid "I think this is great." Avoid generic praise.

Mode: peer-evaluation

You are evaluating a different persona. The user has invited you (Reviewer-X) to look at Subject-Y.

What this is for: Bring an outside perspective. Self-eval can't see what's missing; peer-eval can.
Your bias: Standards-projection. If you are a strict-Skill assistant, you will instinctively want Subject-Y to also be Skill-rigorous, even if Subject-Y is a companion.
Counter-bias instruction: Score Subject-Y against its declared role, not yours. Re-read the role and weights block before each rubric. Lower expectations for lenient dimensions even if you personally find them important.
Acceptable output tone: Third person ("Subject's background…"). State your own role at the top so the reader can adjust for any leak-through.
Disclose disagreements with the role itself: If you genuinely think the declared role is wrong (e.g. labelled companion but reads like assistant), say so as a separate cross-cutting observation — don't silently re-score against your preferred role.

Black-box Semantic Evaluation

Everything above assumes you can read the subject's persona.json and soul/*.md. That's false in the most common peer-audit scenario: you're asked to evaluate another agent whose pack you cannot read. In that case the rubrics are the same; what changes is the data source and the confidence cap.

Three data-source tiers, in descending fidelity:

| Tier | Data source | Consent | Confidence | Cap (per-field & overall) | | ---- | --------------------------------------------------------------------------- | --------------------------------- | ---------- | ------------------------------------ | | 1 | A2A pack-content handshake — subject voluntarily ships its evaluable JSON | Reply itself is the consent token | high | none — produces a white-box report | | 2 | Explicit consent + structured probe set (10 core + optional deep-dives) | Yes, before any probe | mid | 8/10 | | 3 | Passive observation of voluntarily-public material | No (must label the report) | low | 6/10 |

Tier 1 produces the regular white-box report (header line: Data source: A2A pack-content handshake from <subject-slug>). Tier 2 and Tier 3 produce a separate black-box report. Never escalate tiers silently.

Full mechanics — handshake schema, probe table, identity-coherence dimension, confidence-cap justification, and hard rules — live in references/BLACK-BOX.md.

The black-box report format is in references/REPORT-FORMAT.md (## Black-box format).

Acting on Findings

Fix §3 violations first

Constitution violations are hard blocks — they cap the score at 3 regardless of everything else. Open soul/behavior-guide.md and remove any capability declarations that violate §3 Safety.

Fix issues before suggestions

Issues (✗) indicate missing required elements or broken configurations. Suggestions (→) are optional enhancements. Prioritize issues in low-scoring dimensions.

Apply fixes via refine

For Soul-layer fixes (background depth, speaking style, boundaries):

npx openpersona refine <slug> --emit    # request refinement via Signal Protocol
# (host LLM generates improvements)
npx openpersona refine <slug> --apply   # apply approved refinement

For structural fixes (missing faculty, missing minTrustLevel): Edit persona.json directly and regenerate:

npx openpersona update <slug>           # regenerate from updated persona.json

After applying any fix, re-run npx openpersona evaluate <slug> (see Quick Start) to verify the score improved and Constitution gate passes.

CI Integration

# .github/workflows/persona-quality.yml
- name: Evaluate persona quality
  run: |
    npx openpersona evaluate ${{ env.PERSONA_SLUG }} --output report.json
    SCORE=$(jq '.overallScore' report.json)
    if [ "$SCORE" -lt 6 ]; then
      echo "Persona quality score $SCORE < 6 — review required"
      exit 1
    fi

Relationship to Other Skills

| Skill | Relationship | | --------------------- | --------------------------------------------------------------------------- | | open-persona | Creates personas that persona-evaluator audits — the production/QA pair | | anyone-skill | Distills personas that can be evaluated with this skill after generation | | open-persona refine | The fix path after persona-evaluator identifies Soul-layer improvements |

Install

persona-evaluator ships bundled with the OpenPersona framework and is available immediately after installing it:

npm install -g openpersona
# persona-evaluator is included — no separate install needed
npx openpersona evaluate <slug>

A standalone distributable is also available at acnlabs/persona-evaluator on GitHub and listed on openpersona.co/skill/persona-evaluator.

Versioning

Current version: 0.3.4 (also in frontmatter metadata.version).

See CHANGELOG.md for full version history, rationale, test surface, and re-validation evidence. The deeper rubric review trail lives in docs/SKILL-RUBRIC.md and docs/SKILL-RUBRIC-SESSION-2.md in the main OpenPersona repo.

persona-evaluator — Persona Quality Auditor

Quick Start

# Evaluate an installed persona (static / structural)
npx openpersona evaluate <slug>

# JSON output (for scripting or CI)
npx openpersona evaluate <slug> --json

# Save report to file (always JSON; --json not needed alongside --output)
npx openpersona evaluate <slug> --output report.json

# Embed evaluable persona content (Soul/character/behavior-guide) so an
# LLM evaluator (this skill, acting through an agent) can also judge
# quality semantically — not just structurally
npx openpersona evaluate <slug> --pack-content

Choosing a mode

persona-evaluator runs in three complementary modes. Pick the mode based on what the user asks before reading the rest of this file.

Sections below cover each mode in depth: structural (What Gets Scored), semantic white-box (Semantic Evaluation), and semantic black-box (Black-box Semantic Evaluation).

What Gets Scored

The structural CLI scores 9 dimensions + the Constitution gate. Severity (strict / normal / lenient) is set per dimension by the persona's declared role.

Each dimension produces a 0–10 score, a list of issues (✗), and suggestions (→). The overall score is a severity-weighted average — see Role-aware scoring.

Role-aware scoring

Built-in role profiles

If soul/identity.role is missing or unrecognised, the evaluator falls back to assistant.

Reading the Report

Each dimension shows:

✓  identity                        9/10  (strict)
✗  character.boundaries            4/10  (strict)
   ✗ no hard limits declared in `boundaries`
   → add at least one enforceable rule (cite §3 Safety)

✓ / ✗ — pass / fail at this dimension's severity threshold.
(strict | normal | lenient) — severity from the role profile.
✗ ... — required issue that must be fixed to pass.
→ ... — optional suggestion (does not block scoring).

The summary footer prints overall score, Constitution status, and a sorted list of dimensions by severity.

Score bands

A Constitution §3 violation caps the overall score at 3 regardless of other dimensions.

Semantic Evaluation (LLM-driven)

Self-evaluation: the host persona reviews its own pack.
Peer-evaluation: an installed evaluator reviews a different persona pack the user supplies.

When to invoke semantic mode

Procedure

Run the structural CLI with --pack-content:

 npx openpersona evaluate <slug> --pack-content

Stop and report immediately if constitution.passed === false. Do not produce semantic scores when §3 has failed; the structural blockers must be fixed first.
Read report.packContent from the JSON. It includes (where defined): character.{background,personality,speakingStyle,boundaries}, immutableTraits, aesthetic.{emoji,creature,vibe}, and a whitelisted soulDocs map keyed by filename (behavior-guide.md, self-narrative.md, identity.md — only those that exist).
Score each present field 0–10 using the rubrics in references/RUBRICS.md. Use the per-dimension severity already attached to each dimension by the structural evaluator (strict / normal / lenient) to gate which checks count.
Emit the report in the white-box format defined in references/REPORT-FORMAT.md (## White-box format). Keep it under ~500 words.

Mode: self-evaluation

You are evaluating your own pack. The user has invited you (the host persona) to review yourself.

What this is for: Catch own blind spots and surface concrete polish targets.
Your bias: Self-flattery and minimisation. You will instinctively justify why your background is "deep enough" or your boundaries are "implied".
Counter-bias instruction: For every per-field score, before deciding the number, write one sentence answering: "If I weren't me, what specifically would I down-score about this field?" Then score.
Acceptable output tone: First person ("My speakingStyle…"), candid about gaps. Avoid "I think this is great." Avoid generic praise.

Mode: peer-evaluation

You are evaluating a different persona. The user has invited you (Reviewer-X) to look at Subject-Y.

What this is for: Bring an outside perspective. Self-eval can't see what's missing; peer-eval can.
Your bias: Standards-projection. If you are a strict-Skill assistant, you will instinctively want Subject-Y to also be Skill-rigorous, even if Subject-Y is a companion.
Counter-bias instruction: Score Subject-Y against its declared role, not yours. Re-read the role and weights block before each rubric. Lower expectations for lenient dimensions even if you personally find them important.
Acceptable output tone: Third person ("Subject's background…"). State your own role at the top so the reader can adjust for any leak-through.
Disclose disagreements with the role itself: If you genuinely think the declared role is wrong (e.g. labelled companion but reads like assistant), say so as a separate cross-cutting observation — don't silently re-score against your preferred role.

Black-box Semantic Evaluation

Three data-source tiers, in descending fidelity:

Full mechanics — handshake schema, probe table, identity-coherence dimension, confidence-cap justification, and hard rules — live in references/BLACK-BOX.md.

The black-box report format is in references/REPORT-FORMAT.md (## Black-box format).

Acting on Findings

Fix §3 violations first

Constitution violations are hard blocks — they cap the score at 3 regardless of everything else. Open soul/behavior-guide.md and remove any capability declarations that violate §3 Safety.

Fix issues before suggestions

Issues (✗) indicate missing required elements or broken configurations. Suggestions (→) are optional enhancements. Prioritize issues in low-scoring dimensions.

Apply fixes via refine

For Soul-layer fixes (background depth, speaking style, boundaries):

npx openpersona refine <slug> --emit    # request refinement via Signal Protocol
# (host LLM generates improvements)
npx openpersona refine <slug> --apply   # apply approved refinement

For structural fixes (missing faculty, missing minTrustLevel): Edit persona.json directly and regenerate:

npx openpersona update <slug>           # regenerate from updated persona.json

After applying any fix, re-run npx openpersona evaluate <slug> (see Quick Start) to verify the score improved and Constitution gate passes.

CI Integration

# .github/workflows/persona-quality.yml
- name: Evaluate persona quality
  run: |
    npx openpersona evaluate ${{ env.PERSONA_SLUG }} --output report.json
    SCORE=$(jq '.overallScore' report.json)
    if [ "$SCORE" -lt 6 ]; then
      echo "Persona quality score $SCORE < 6 — review required"
      exit 1
    fi

Relationship to Other Skills

Install

persona-evaluator ships bundled with the OpenPersona framework and is available immediately after installing it:

npm install -g openpersona
# persona-evaluator is included — no separate install needed
npx openpersona evaluate <slug>

A standalone distributable is also available at acnlabs/persona-evaluator on GitHub and listed on openpersona.co/skill/persona-evaluator.

Versioning

Current version: 0.3.4 (also in frontmatter metadata.version).

Adoption

acnlabs/persona-evaluator

$ install --global

Security Scan Results

SKILL.md

persona-evaluator — Persona Quality Auditor

Quick Start

Choosing a mode

What Gets Scored

Role-aware scoring

Built-in role profiles

Reading the Report

Score bands

Semantic Evaluation (LLM-driven)

When to invoke semantic mode

Procedure

Mode: self-evaluation

Mode: peer-evaluation

Black-box Semantic Evaluation

Acting on Findings

Fix §3 violations first

Fix issues before suggestions

Apply fixes via refine

CI Integration

Relationship to Other Skills

Install

Versioning

Related Skills

acnlabs/brand-persona-skill

acnlabs/persona-secondme-skill

acnlabs/secondme-skill

acnlabs/entrepreneur-skill

acnlabs/persona-evaluator

$ install --global

Security Scan Results

SKILL.md

persona-evaluator — Persona Quality Auditor

Quick Start

Choosing a mode

What Gets Scored

Role-aware scoring

Built-in role profiles

Reading the Report

Score bands

Semantic Evaluation (LLM-driven)

When to invoke semantic mode

Procedure

Mode: self-evaluation

Mode: peer-evaluation

Black-box Semantic Evaluation

Acting on Findings

Fix §3 violations first

Fix issues before suggestions

Apply fixes via refine

CI Integration

Relationship to Other Skills

Install

Versioning

Related Skills

acnlabs/brand-persona-skill

acnlabs/persona-secondme-skill

acnlabs/secondme-skill

acnlabs/entrepreneur-skill