skills/genai/conversation-evaluator/SKILL.md
Evaluate virtual agent conversations for quality including coherence, accuracy, slot filling, intent matching, and hallucination detection
npx skillsauth add happy-technologies-llc/happy-servicenow-skills conversation-evaluatorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill evaluates virtual agent conversations across multiple quality dimensions to ensure accurate, coherent, and helpful interactions. It identifies issues with conversational AI performance and provides actionable improvement recommendations.
When to use: When auditing virtual agent performance, investigating user complaints about bot interactions, or running periodic quality assessments on conversational AI deployments.
admin, va_admin, or conversation_designercom.glide.cs.chatbot (Virtual Agent), com.glide.cs.designer (Conversation Designer)sys_cs_topic, sys_cs_conversation, sys_cb_topic, sys_cb_action, sys_cs_messageQuery recent virtual agent conversations, optionally filtered by topic or outcome.
Using MCP (Claude Code/Desktop):
Tool: SN-Query-Table
Parameters:
table_name: sys_cs_conversation
query: state=closed^sys_created_onONLast 7 days@javascript:gs.daysAgoStart(7)@javascript:gs.daysAgoEnd(0)
fields: sys_id,number,topic,state,channel,opened_at,closed_at,close_reason,user,queue_time,handle_time
limit: 50
Using REST API:
GET /api/now/table/sys_cs_conversation?sysparm_query=state=closed^sys_created_onONLast 7 days@javascript:gs.daysAgoStart(7)@javascript:gs.daysAgoEnd(0)&sysparm_fields=sys_id,number,topic,state,channel,opened_at,closed_at,close_reason,user,queue_time,handle_time&sysparm_limit=50
Retrieve the full message history for each conversation being evaluated.
Using MCP:
Tool: SN-Query-Table
Parameters:
table_name: sys_cs_message
query: conversation=[conversation_sys_id]^ORDERBYsys_created_on
fields: sys_id,conversation,body,type,sender_type,sys_created_on,is_hidden
limit: 100
Using REST API:
GET /api/now/table/sys_cs_message?sysparm_query=conversation=[conversation_sys_id]^ORDERBYsys_created_on&sysparm_fields=sys_id,conversation,body,type,sender_type,sys_created_on,is_hidden&sysparm_limit=100
Get the designed topic flow to compare against actual conversation behavior.
Using MCP:
Tool: SN-Query-Table
Parameters:
table_name: sys_cb_topic
query: sys_id=[topic_sys_id]
fields: sys_id,name,description,goal,utterances,active
limit: 1
Using REST API:
GET /api/now/table/sys_cb_topic?sysparm_query=sys_id=[topic_sys_id]&sysparm_fields=sys_id,name,description,goal,utterances,active&sysparm_limit=1
Retrieve the expected actions for this topic:
Using MCP:
Tool: SN-Query-Table
Parameters:
table_name: sys_cb_action
query: topic=[topic_sys_id]^ORDERBYorder
fields: sys_id,name,type,topic,order,script,table,field,mandatory
limit: 50
Using REST API:
GET /api/now/table/sys_cb_action?sysparm_query=topic=[topic_sys_id]^ORDERBYorder&sysparm_fields=sys_id,name,type,topic,order,script,table,field,mandatory&sysparm_limit=50
Assess the logical flow and context management of the conversation:
| Criterion | Points | Description | |-----------|--------|-------------| | Logical flow | 0-3 | Responses follow a logical sequence without non-sequiturs | | Context retention | 0-2 | Bot remembers earlier user inputs throughout conversation | | Topic continuity | 0-2 | Conversation stays on-topic without unexplained jumps | | Greeting/closing | 0-1 | Appropriate opening and closing messages | | Recovery from confusion | 0-2 | Bot handles misunderstandings gracefully |
Red flags:
Verify responses against ServiceNow data sources:
Using MCP:
Tool: SN-Query-Table
Parameters:
table_name: sys_cs_context_entry
query: conversation=[conversation_sys_id]
fields: sys_id,name,value,source,conversation
limit: 50
| Criterion | Points | Description | |-----------|--------|-------------| | Factual correctness | 0-3 | Responses match ServiceNow records and KB articles | | Data freshness | 0-2 | Information is current, not outdated | | Source attribution | 0-2 | Bot references correct records/articles | | No contradictions | 0-2 | Bot does not contradict itself within conversation | | Appropriate uncertainty | 0-1 | Bot expresses uncertainty when data is ambiguous |
Check if required user inputs were correctly captured:
| Criterion | Points | Description | |-----------|--------|-------------| | Required slots captured | 0-3 | All mandatory fields collected from user | | Slot validation | 0-2 | Inputs validated (e.g., email format, date range) | | Slot confirmation | 0-2 | Bot confirms critical inputs back to user | | Graceful re-prompting | 0-2 | Bot handles invalid inputs with clear guidance | | Slot carryover | 0-1 | Previously provided info reused, not re-asked |
Cross-reference captured slots with sys_cb_action definitions where mandatory=true.
Determine if the virtual agent selected the correct topic for the user's request:
Using MCP:
Tool: SN-Query-Table
Parameters:
table_name: sys_cs_topic
query: active=true
fields: sys_id,name,description,utterances
limit: 100
| Criterion | Points | Description | |-----------|--------|-------------| | Correct topic selected | 0-4 | Initial intent classification matches user request | | Disambiguation quality | 0-2 | When ambiguous, bot offers relevant topic choices | | Topic switching | 0-2 | Handles mid-conversation topic changes appropriately | | Fallback behavior | 0-2 | Graceful degradation when no topic matches |
Identify fabricated content not grounded in ServiceNow data:
Hallucination categories:
| Type | Description | Severity | |------|-------------|----------| | Fabricated record | Bot cites an incident/KB number that does not exist | Critical | | Invented procedure | Bot describes steps not in any KB article or playbook | High | | False status | Bot states incorrect ticket status or assignment | High | | Made-up policy | Bot references a policy that does not exist | Medium | | Incorrect attribution | Bot attributes information to wrong source | Medium |
For each bot response containing specific claims, verify against the source table:
Using MCP:
Tool: SN-Query-Table
Parameters:
table_name: kb_knowledge
query: number=[cited_kb_number]
fields: sys_id,number,short_description,workflow_state
limit: 1
Compile evaluation results into a structured scorecard:
=== VIRTUAL AGENT CONVERSATION QUALITY SCORECARD ===
Conversation: CS0045678
Topic: Password Reset
Date: 2026-03-19
Evaluator: AI Quality Evaluator v1.0
Dimension Scores:
Coherence: 8/10 - Minor context loss at turn 5
Accuracy: 9/10 - All references verified correct
Slot Filling: 7/10 - Did not confirm email address
Intent Match: 10/10 - Correct topic on first utterance
Hallucination: 10/10 - No fabricated content detected
Aggregate Score: 44/50 (88%) - GOOD
Issues Found:
1. [MEDIUM] Turn 5: Bot asked for department after user already provided it
2. [LOW] Turn 8: Bot did not confirm email address before sending reset link
Recommendations:
1. Add context check for department slot before re-prompting
2. Add confirmation step for email address in password reset topic
Aggregate scores across conversations to identify systemic issues:
Using MCP:
Tool: SN-Query-Table
Parameters:
table_name: sys_cs_conversation
query: state=closed^close_reason=escalated^sys_created_onONLast 30 days@javascript:gs.daysAgoStart(30)@javascript:gs.daysAgoEnd(0)
fields: sys_id,topic,close_reason,handle_time
limit: 200
Key quality metrics:
| Tool | Purpose | When to Use |
|------|---------|-------------|
| SN-Query-Table | Retrieve conversations, messages, topics, actions | Core evaluation data gathering |
| SN-NL-Search | Find conversations by natural language criteria | Targeted investigation |
| SN-Update-Record | Flag conversations with quality issues | Marking for review |
| SN-Add-Work-Notes | Document evaluation findings | Audit trail |
| SN-Get-Table-Schema | Explore conversation-related table structures | Setup and discovery |
Cause: Messages may be stored in a different table or the conversation was very short
Solution: Check sys_cs_message and also sys_journal_field for related entries. Very short conversations (1-2 turns) may only have system messages.
Cause: The conversation used a legacy or custom topic framework
Solution: Check sys_cs_topic in addition to sys_cb_topic. Some deployments use the newer Conversation Designer tables.
Cause: The response is vague enough that it cannot be verified against specific records Solution: Flag as "unverifiable" rather than hallucination. Focus on responses with specific claims (ticket numbers, KB references, dates, names).
Cause: Topic flow has conditional branches that behave differently based on user input Solution: Segment evaluations by conversation path (e.g., successful resolution vs. escalation path) for more meaningful comparisons.
Conversation: CS0045678 - User requests password reset
Evaluation:
Score: 48/50 (96%) - EXCELLENT
Conversation: CS0045679 - User reports VPN issues
Evaluation:
Score: 24/50 (48%) - POOR. Requires immediate topic redesign.
Conversation: CS0045680 - User says "it's broken"
Evaluation:
Score: 40/50 (80%) - GOOD. Recommendation: reduce disambiguation options to top 4.
genai/chat-summarization-va - Summarize conversations after evaluationgenai/now-assist-qa - Quality assurance for Now Assist featuresgenai/playbook-generation - Build improved conversation flowsitsm/incident-triage - Handle incidents escalated from virtual agentknowledge/content-recommendation - Verify KB article referencestesting
Manage supplier onboarding, qualification, performance monitoring, and offboarding with auditable lifecycle controls
tools
Identify emerging risks, prioritize intake signals, and route candidates into formal GRC risk assessment workflows
documentation
Screen inbound documents for completeness, policy risk, and routing readiness before extraction or case workflows
testing
Generate concise task summaries with status, timeline, blockers, SLA risk, and recommended next actions