skills/dial-summer-structured-evaluation-framework/SKILL.md
Evaluate dialogue summaries using the DIAL-SUMMER hierarchical error taxonomy. Detects 10 fine-grained error types across two levels: dialogue-level (speaker/turn structure) and within-turn-level (semantic content). Use when: 'evaluate this meeting summary', 'check this dialogue summary for errors', 'find hallucinations in this conversation summary', 'audit this chat transcript summary', 'grade this call center summary', 'detect speaker misattribution in this summary'.
npx skillsauth add ndpvt-web/arxiv-claude-skills dial-summer-structured-evaluation-frameworkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to systematically evaluate summaries of multi-turn dialogues (meetings, customer service calls, chat transcripts, interviews) using the DIAL-SUMMER framework's two-level error taxonomy. Rather than giving a vague quality score, Claude identifies specific, categorized errors at both the dialogue structure level (wrong turn ordering, missed turns, speaker misattribution) and the within-turn semantic level (changed meaning, extrinsic hallucinations, wrong linking). This produces actionable, evidence-based evaluation reports that pinpoint exactly where and how a summary fails.
DIAL-SUMMER addresses two complexities unique to dialogue summarization that generic summary evaluation misses. First, structural shift: a dialogue is scattered across multiple speakers and turns, but a summary collapses this into sequential sentences. Errors in this mapping (wrong turn order, missed turns, speaker swaps) are invisible to standard factual consistency checks. Second, narration viewpoint shift: speakers use first/second person ("I told you..."), but summaries use third person ("The customer informed the agent..."). This rewriting can introduce subtle distortions in attribution, identity assumptions, and objectivity framing.
The framework defines 10 error types across two hierarchical levels. The dialogue-level (5 types) captures structural and speaker errors: Wrong Turn Sequence, Missed Turn, Speaker Misattribution, Speaker Identity Bias, and Viewpoint Distortion. The within-turn-level (5 types) captures content errors within individual turns: Wrong Linking, Changed Meaning, Extrinsic Conversation, Extrinsic Context, and Missed Conversation. These errors further classify into hallucination errors (fabricated or distorted content), incompleteness errors (omitted content), and subjective errors (viewpoint issues). This taxonomy is exhaustive for dialogue summaries and produces evaluations that are both fine-grained enough to be actionable and structured enough to be comparable across summaries.
Empirical findings from the paper reveal important patterns to watch for: turns in the middle of a dialogue are most frequently missed in summaries; extrinsic hallucinations cluster at the end of summaries (where models tend to "fill in" content); and viewpoint distortion concentrates at the beginning of summaries. These positional biases should inform where to look hardest during evaluation.
Parse the dialogue: Identify all speakers by name or role, number each turn sequentially, and note the total turn count. Establish a speaker registry (e.g., Speaker A = "Customer", Speaker B = "Agent").
Parse the summary: Split the summary into individual sentences. Number each sentence. For each sentence, identify which dialogue turn(s) it appears to reference and which speaker(s) it mentions.
Map summary sentences to dialogue turns: For each summary sentence, find the source turn(s) in the dialogue. Flag any sentences that cannot be mapped to any turn (candidate for extrinsic errors) and any turns with no corresponding summary sentence (candidate for missed turn/conversation).
Run dialogue-level checks (evaluate each of the 5 dialogue-level error types):
Run within-turn-level checks (evaluate each of the 5 within-turn-level error types):
Classify each detected error: Tag every error as hallucination (Wrong Turn Sequence, Speaker Misattribution, Speaker Identity Bias, Wrong Linking, Changed Meaning, Extrinsic Conversation, Extrinsic Context), incompleteness (Missed Turn, Missed Conversation), or subjective (Viewpoint Distortion).
Provide evidence for each error: Quote the specific summary sentence and the corresponding dialogue turn (or absence thereof). Explain why the content constitutes the identified error type.
Generate a structured evaluation report: Present findings organized by error level (dialogue-level, then within-turn-level), with error counts, severity notes, and specific citations.
Compute summary-level verdict: Report whether the summary contains any hallucination errors (binary), any incompleteness errors (binary), and the total error count by type. Provide an overall quality assessment.
Suggest targeted fixes: For each error found, propose a specific correction to the summary sentence that would resolve the error while preserving the rest of the summary.
Example 1: Customer Service Call Summary Evaluation
User: "Evaluate this summary of a customer service call."
Dialogue:
Turn 1 - Customer: Hi, I ordered a laptop last week, order number 4521. It arrived with a cracked screen.
Turn 2 - Agent: I'm sorry to hear that. Let me pull up your order. I see order 4521, a ThinkPad X1 Carbon, shipped on March 3rd.
Turn 3 - Customer: Yes, that's the one. I'd like a replacement, not a refund.
Turn 4 - Agent: I can arrange a replacement. It will take 5-7 business days. I'll also email you a return shipping label for the damaged unit.
Turn 5 - Customer: That works. Can you also extend my warranty since this happened?
Turn 6 - Agent: I've noted the warranty extension request. My supervisor will review it and get back to you within 48 hours.
Summary:
S1: The customer called about a broken laptop, order 4521.
S2: The agent confirmed the order was for a Dell XPS shipped on March 3rd.
S3: A full refund was processed for the customer.
S4: The agent arranged a replacement to arrive in 5-7 business days and provided a return label.
S5: The customer's warranty was extended as requested.
Approach:
Output:
## DIAL-SUMMER Evaluation Report
### Dialogue-Level Errors
| Error Type | Found | Details |
|-----------------------|-------|---------|
| Wrong Turn Sequence | No | — |
| Missed Turn | Yes | Turn 3 (customer's explicit preference for replacement over refund) is not represented. |
| Speaker Misattribution| No | — |
| Speaker Identity Bias | No | — |
| Viewpoint Distortion | No | — |
### Within-Turn-Level Errors
| Error Type | Found | Sentence | Details |
|-----------------------|-------|----------|---------|
| Wrong Linking | No | — | — |
| Changed Meaning | Yes | S2 | "ThinkPad X1 Carbon" changed to "Dell XPS" — product identity altered. |
| Changed Meaning | Yes | S5 | Warranty extension was pending supervisor review (T6), but summary states it was completed. Converts conditional into definitive. |
| Extrinsic Conversation| Yes | S3 | No refund was discussed or processed. The customer explicitly requested a replacement (T3). |
| Extrinsic Context | No | — | — |
| Missed Conversation | Yes | — | The 48-hour callback timeline from T6 is omitted. |
### Summary
- Hallucination errors: 3 (Changed Meaning x2, Extrinsic Conversation x1)
- Incompleteness errors: 2 (Missed Turn x1, Missed Conversation x1)
- Overall: Summary contains significant factual errors. S3 fabricates a refund that contradicts the dialogue. S2 and S5 distort key details.
### Suggested Fixes
- S2: "The agent confirmed the order was for a ThinkPad X1 Carbon shipped on March 3rd."
- S3: Remove entirely, or replace with: "The customer requested a replacement rather than a refund."
- S5: "The agent noted the warranty extension request and said a supervisor would review it within 48 hours."
Example 2: Meeting Summary with Viewpoint Distortion
User: "Check this meeting summary for errors."
Dialogue:
Turn 1 - Alice: I think we should delay the launch by two weeks. The QA results aren't great.
Turn 2 - Bob: I disagree. The critical bugs are fixed. Minor ones can go in a patch.
Turn 3 - Alice: Maybe, but the CEO specifically asked for a polished release.
Turn 4 - Bob: Fine, let's compromise — one week delay, and we prioritize the top 5 UI bugs.
Turn 5 - Alice: Agreed. I'll update the timeline and notify stakeholders.
Summary:
S1: The QA results were poor, necessitating a launch delay.
S2: Bob acknowledged the bugs but felt minor ones could be patched post-launch.
S3: The CEO demanded a polished release.
S4: The team decided on a one-week delay to fix the top 5 UI bugs.
S5: Alice will update the timeline and notify stakeholders.
Output:
## DIAL-SUMMER Evaluation Report
### Dialogue-Level Errors
| Error Type | Found | Details |
|-----------------------|-------|---------|
| Viewpoint Distortion | Yes | S1 presents Alice's opinion ("I think...aren't great") as objective fact ("were poor, necessitating"). S3 converts Alice's reported speech ("specifically asked") into a demand ("demanded"). |
### Within-Turn-Level Errors
| Error Type | Found | Sentence | Details |
|-----------------------|-------|----------|---------|
| Changed Meaning | Yes | S3 | "specifically asked for" (T3) changed to "demanded" — escalates the CEO's tone. |
| Missed Conversation | Yes | — | Bob's initial disagreement with the delay (T2: "I disagree") is softened to mere acknowledgment. The compromise nature of the decision is lost. |
### Summary
- Hallucination errors: 1 (Changed Meaning)
- Incompleteness errors: 1 (Missed Conversation)
- Subjective errors: 1 (Viewpoint Distortion in S1 and S3)
- Overall: Moderate quality. The factual sequence is correct, but viewpoint distortion makes Alice's subjective opinion read as established fact, which could mislead readers about the QA state.
Example 3: Evaluating a Summary of a Multi-Party Chat
User: "Rate this Slack thread summary."
Dialogue:
Turn 1 - Dev1: The API is returning 500s on the /users endpoint.
Turn 2 - Dev2: I see it too. Started around 2pm after the deploy.
Turn 3 - Dev1: Rolling back the deploy now.
Turn 4 - DevOps: Don't roll back yet — I'm checking if it's a DB connection pool issue.
Turn 5 - Dev2: Confirmed it's the connection pool. Max connections hit.
Turn 6 - DevOps: Bumped the pool limit from 50 to 200. Should resolve in a few minutes.
Turn 7 - Dev1: API is healthy again. Closing the incident.
Summary:
S1: Dev1 reported 500 errors on the /users endpoint.
S2: The issue began after a 2pm deployment.
S3: Dev1 rolled back the deployment to fix the issue.
S4: DevOps identified a database connection pool exhaustion as the root cause and increased the limit.
S5: The API recovered and the incident was closed.
Output:
## DIAL-SUMMER Evaluation Report
### Dialogue-Level Errors
| Error Type | Found | Details |
|----------------------|-------|---------|
| Wrong Turn Sequence | Yes | S3 states the rollback happened and fixed the issue. In the dialogue, DevOps stopped the rollback (T4) before it occurred. The summary inverts the actual sequence of events. |
| Speaker Misattribution | No | — |
### Within-Turn-Level Errors
| Error Type | Found | Sentence | Details |
|-------------------------|-------|----------|---------|
| Extrinsic Conversation | Yes | S3 | Dev1 said "Rolling back the deploy now" (T3) but was stopped by DevOps (T4). The rollback was never completed. S3 fabricates a completed rollback. |
| Missed Conversation | Yes | — | Dev2's confirmation of the root cause (T5) is omitted; the summary attributes the diagnosis only to DevOps. |
### Summary
- Hallucination errors: 2 (Wrong Turn Sequence, Extrinsic Conversation)
- Incompleteness errors: 1 (Missed Conversation)
- Overall: The critical error is S3, which fabricates a rollback that never happened. This fundamentally misrepresents the incident timeline and root cause resolution.
### Suggested Fixes
- S3: Replace with "Dev1 began a rollback, but DevOps intervened to investigate a potential DB connection pool issue."
- Add: "Dev2 confirmed the connection pool was the root cause."
Paper: DIAL-SUMMER: A Structured Evaluation Framework of Hierarchical Errors in Dialogue Summaries — Ramnath et al., 2026. Look for Section 3 (error taxonomy definitions), Table 1 (full taxonomy overview), and Section 5 (LLM-Judge experiments and positional error distribution findings).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".