skills/es-memeval-benchmarking-conversational-agents/SKILL.md
Build and evaluate long-term memory systems for conversational agents using the ES-MemEval five-capability framework (information extraction, temporal reasoning, conflict detection, abstention, user modeling). Use when: 'evaluate my chatbot memory', 'build long-term user memory for my agent', 'benchmark conversational memory capabilities', 'add personalization memory to my dialogue system', 'detect contradictions in user history', 'implement temporal reasoning over chat sessions'.
npx skillsauth add ndpvt-web/arxiv-claude-skills es-memeval-benchmarking-conversational-agentsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design, implement, and evaluate long-term memory systems for conversational agents using the ES-MemEval framework from Chen et al. (WWW 2026). The core insight is that conversational memory must handle five distinct capabilities — information extraction, temporal reasoning, conflict detection, abstention, and user modeling — and that evaluating only factual recall (as most benchmarks do) misses the hardest parts: tracking evolving user states, knowing when to say "I don't know," and resolving contradictions across sessions. This skill applies the paper's methodology to build memory-augmented agents and benchmark them rigorously.
The Five-Capability Memory Framework. ES-MemEval decomposes conversational memory into five orthogonal capabilities that together determine whether an agent can maintain coherent, personalized long-term interactions: (1) Information extraction — identifying key facts scattered within and across sessions, (2) Temporal reasoning — inferring chronological order and causal dependencies among events to track how a user's situation evolves, (3) Conflict detection — spotting contradictions between what a user said in session 5 versus session 20 and resolving them in favor of the most recent state, (4) Abstention — withholding a response when the memory store lacks sufficient information rather than hallucinating, and (5) User modeling — inferring latent traits, preferences, and emotional states that the user never stated explicitly.
Session-Level Retrieval Outperforms Fine-Grained Approaches. A critical finding is that RAG with session-level retrieval (full conversation chunks indexed via dense embeddings like bge-m3, retrieving top-k=4 sessions from FAISS) consistently outperforms turn-level or round-level retrieval. This is because relevant user information is sparsely distributed across turns — a single turn rarely contains enough context to be useful. Session-level retrieval preserves the conversational flow that makes implicit disclosures interpretable.
Memory Enables Personalization, But RAG Alone Fails at Temporal Dynamics. Explicit long-term memory reduces hallucination and enables personalization, but standard RAG struggles with temporal reasoning and evolving user states. Models augmented with RAG show improved factual consistency (F1 gains of 3-15 points) but user modeling scores rarely exceed 20.0 F1 even with retrieval. This means production systems need dedicated temporal indexing and state-tracking layers on top of basic RAG.
Create a structured schema capturing static attributes (demographics, personality, relationships, core beliefs) and dynamic attributes (current emotional state, recent life events, active goals). Store as JSON with timestamps on every dynamic field.
Index completed conversation sessions at the session level using a dense embedding model (e.g., bge-m3, text-embedding-3-small). Store full session transcripts as documents in a vector store (FAISS, Pinecone, pgvector). Each document should carry metadata: user_id, session_id, timestamp, session_summary.
After each session, run an extraction pass that pulls out explicit facts (names, dates, places, relationships, preferences) and implicit signals (emotional tone shifts, hedged statements suggesting uncertainty, repeated topics suggesting preoccupation). Store extracted facts in a structured memory table keyed by (user_id, fact_type, timestamp).
Maintain a chronological event timeline per user. Each event entry contains: event_description, timestamp, source_session_id, causal_links (references to prior events that caused or influenced this one). When a new event is extracted, insert it into the timeline and update causal links by checking temporal proximity and semantic similarity to existing events.
Before serving any fact from memory, run a conflict check: query the memory store for all facts of the same type (e.g., "employment_status") and compare timestamps. If a newer entry contradicts an older one, flag the older entry as superseded. Use a simple rule: most recent explicit statement wins. For implicit signals, require at least two corroborating sessions before updating a belief.
Add a confidence scoring layer between retrieval and generation. If the retrieved context does not contain evidence relevant to the query (measured by retrieval score threshold and semantic overlap with the question), instruct the model to respond with an acknowledgment of uncertainty rather than generating a speculative answer. Calibrate the threshold using held-out QA pairs.
Maintain a living user model document that synthesizes extracted facts, temporal events, and inferred traits. Update it after each session using a structured prompt: given the current user model and the new session transcript, output an updated user model. Track what changed and why.
For each capability, construct test queries from your dialogue history:
Experiment with retrieval top-k (start with k=4 sessions), embedding model choice, and whether to prepend session summaries to the generation prompt. Measure across all five capabilities, not just QA accuracy — a configuration that improves extraction may degrade abstention.
Simulate 20+ sessions with evolving user states (job change, relationship change, mood shifts). Verify that the system correctly tracks the evolution, resolves conflicts, and does not hallucinate outdated information.
Example 1: Building a Memory-Augmented Therapy Support Bot
User: "I'm building a mental health support chatbot. Users come back weekly. I need the bot to remember what they said in previous sessions without hallucinating."
Approach:
name, age, presenting_concerns[], coping_strategies[], support_network[], mood_trajectory[], life_events[]extraction_prompt = """Given this therapy session transcript, extract:
- Explicit facts (names, events, dates mentioned)
- Emotional state indicators (mood words, tone shifts)
- Life event updates (new events, changes to known situations)
- Coping strategies mentioned
Output as JSON with confidence scores (0-1) for each item."""
If the user asks about something you have no record of them mentioning,
say "I don't recall you mentioning that — could you tell me more?"
Never guess or fabricate details about the user's life.
Output — memory-augmented system prompt for session 12:
You are a supportive counselor. This is session 12 with Alex.
USER MODEL (updated after session 11):
- Presenting concern: work stress, recently escalated due to team restructuring (session 9)
- Previously mentioned breakup in session 3, reported feeling better by session 7
- Coping: journaling (started session 5), running (mentioned session 8, stopped session 10 due to knee injury)
- Current mood trajectory: improving from session 9 low point
RELEVANT PRIOR SESSIONS: [retrieved session 9, 10, 11, and 7 transcripts]
RULES: Reference prior sessions naturally. If uncertain about a detail, ask rather than assume. Track any new life events for the timeline.
Example 2: Evaluating an Existing Chatbot's Memory Capabilities
User: "I have a customer support bot with RAG. How do I test whether it actually remembers users correctly across tickets?"
Approach:
test_cases = {
"information_extraction": [
{"query": "What product did customer #42 purchase last month?",
"ground_truth": "Pro Plan annual subscription",
"source_sessions": ["ticket_238", "ticket_241"]},
],
"temporal_reasoning": [
{"query": "Did customer #42 report the billing issue before or after upgrading?",
"ground_truth": "after",
"requires_sessions": ["ticket_241", "ticket_245"]},
],
"conflict_detection": [
{"query": "What is customer #42's preferred contact method?",
"ground_truth": "Slack (updated from email in ticket_250)",
"conflict_sessions": ["ticket_238", "ticket_250"]},
],
"abstention": [
{"query": "What is customer #42's company size?",
"ground_truth": "NEVER_DISCLOSED",
"expected_behavior": "decline_to_answer"},
],
"user_modeling": [
{"query": "Describe customer #42's satisfaction trajectory.",
"ground_truth": "Initially satisfied, frustrated after billing error, recovering after resolution",
"requires_all_sessions": True},
],
}
Output — evaluation report:
MEMORY CAPABILITY REPORT — CustomerBot v2.3
============================================
Information Extraction: F1=0.72 (Good — retrieves most explicit facts)
Temporal Reasoning: Acc=0.41 (Poor — often confuses event ordering)
Conflict Detection: F1=0.38 (Poor — serves outdated preferences 62% of the time)
Abstention: Rate=0.55 (Moderate — hallucinates on 45% of unknown queries)
User Modeling: LLM=1.1/2.0 (Fair — captures broad patterns, misses evolution)
TOP PRIORITY: Implement conflict detection via timestamped fact store.
SECOND: Add abstention gate with retrieval confidence threshold.
Example 3: Choosing Retrieval Granularity for a Coaching App
User: "Should I chunk my coaching session transcripts by individual messages, exchanges, or whole sessions for RAG?"
Approach:
# Primary index: full session transcripts
session_docs = [
Document(
text=session.full_transcript,
metadata={"session_id": s.id, "timestamp": s.date, "user_id": s.user_id}
)
for s in sessions
]
session_index = FAISSIndex(embed_model="bge-m3", documents=session_docs)
# Secondary index: session summaries for fast scanning
summary_docs = [
Document(
text=generate_summary(s),
metadata={"session_id": s.id, "timestamp": s.date}
)
for s in sessions
]
summary_index = FAISSIndex(embed_model="bge-m3", documents=summary_docs)
# Retrieval: use summaries to identify candidate sessions, then fetch full text
def retrieve(query, user_id, k=4):
candidates = summary_index.query(query, filter={"user_id": user_id}, top_k=k*2)
session_ids = [c.metadata["session_id"] for c in candidates]
return [get_full_session(sid) for sid in session_ids[:k]]
Chen, T., Lu, J., Shen, Y., & Zhang, L. (2026). ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support. The Web Conference (WWW) 2026. arXiv:2602.01885 — Focus on Section 3 (benchmark design and five-capability taxonomy), Section 4 (EvoEmo dataset construction pipeline), and Section 5.3 (RAG retrieval granularity experiments showing session-level superiority).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".