skills/evermembench-benchmarking-long-term-interactive/SKILL.md
Build and evaluate long-term conversational memory systems for multi-party, multi-topic dialogues. Implements the EverMemBench framework for stress-testing memory architectures against realistic workplace conversation patterns with temporal evolution, cross-topic interleaving, and role-specific personas. Use when: 'build a memory system for multi-user chat', 'evaluate my RAG memory pipeline', 'benchmark long-term conversation recall', 'test memory across multi-party dialogues', 'design a temporal memory store for chat agents', 'audit retrieval quality for conversational AI'.
npx skillsauth add ndpvt-web/arxiv-claude-skills evermembench-benchmarking-long-term-interactiveInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design, build, and evaluate long-term conversational memory systems that handle the hard cases real-world chat applications face: multiple speakers, evolving facts, interleaved topics, and implicit context that similarity search misses. It applies the EverMemBench framework's three-dimensional evaluation (fine-grained recall, memory awareness, user profile understanding) to diagnose exactly where a memory architecture fails and guide targeted fixes.
EverMemBench identifies three orthogonal failure modes in conversational memory that existing benchmarks miss. Fine-grained recall tests whether the system can retrieve specific facts, with subcategories for single-hop (direct entity lookup), multi-hop (reasoning across multiple conversation groups or speakers), and temporal (tracking version history of evolving facts). Memory awareness tests whether the system knows what it knows: can it apply implicit constraints from past conversations, proactively surface relevant context for new proposals, and detect when earlier decisions have been superseded? Profile understanding tests whether the system can infer a user's communication style, professional skills, and organizational role from their conversational behavior across interactions.
The critical finding is a hierarchy of difficulty that exposes where architectures break. Single-hop recall is essentially solved (97-99% with oracle evidence), but multi-hop collapses to 26% even with perfect retrieval because reasoning must bridge information scattered across speakers and groups. Temporal reasoning caps at 60% because it requires version semantics -- understanding that "the Q3 budget" mentioned on Wednesday supersedes the one from Monday -- not just timestamp matching. Memory awareness tasks show a 20-70 point gap between full-context and retrieval-augmented systems, proving the bottleneck is retrieval fidelity, not reasoning capability. Strong models (Gemini-3-Flash) actually degrade from 72% to 52-62% when memory augmentation is added, because retrieval introduces artifacts and version conflicts that confuse capable reasoners.
The actionable insight: memory architectures must move beyond flat vector stores toward versioned, graph-structured representations that preserve provenance (who said what, when, in which context) and support explicit supersession relationships between facts. Retrieval must bridge the semantic gap between a user's query ("what's the current budget?") and implicitly relevant memories (a side conversation where a constraint change invalidated the previously agreed number).
Characterize the conversation corpus. Count the number of distinct speakers, conversation groups/channels, total token volume, and time span. Identify whether information evolves over time (facts get updated) and whether topics interleave across groups. This determines which EverMemBench dimensions are relevant.
Define the memory schema with provenance fields. Every memory entry must store: content, speaker identity, group/channel, timestamp, topic tags, and a supersession pointer (which earlier memory this updates, if any). Do not flatten conversations into anonymous text chunks -- the metadata is what makes multi-hop and temporal reasoning possible.
Implement tiered memory indexing. Build three retrieval paths: (a) semantic similarity for single-hop recall, (b) graph traversal for multi-hop queries that require connecting facts across speakers/groups, and (c) temporal version chains for queries about the current state of evolving facts. A single embedding-based retriever will fail on (b) and (c).
Generate diagnostic QA pairs across all three dimensions. For fine-grained recall: create single-hop (direct fact lookup), multi-hop (requires joining facts from 2+ conversation fragments), and temporal (requires identifying the latest version of a changed fact) questions. For memory awareness: create constraint-application, proactive-surfacing, and update-detection scenarios. For profile understanding: create style-matching, skill-inference, and role-identification queries.
Run a blind-test filter on generated QA pairs. Present each question to the LLM without any conversation context. If it answers correctly from world knowledge alone, the question is not testing memory -- discard it. This prevents inflated scores from parametric knowledge leakage.
Evaluate with evidence grounding. For each QA pair, verify: (a) the answer is derivable from the ground-truth evidence segment (sufficiency), and (b) the answer is NOT derivable from alternative segments (uniqueness). This ensures questions test retrieval precision, not lucky guessing from noisy context.
Benchmark retrieval isolation. Run the full pipeline, then run the same questions with oracle (ground-truth) evidence substituted for retrieved evidence. The gap between oracle and retrieved performance quantifies the retrieval bottleneck. If oracle performance is also low (as with multi-hop at 26%), the problem is reasoning, not retrieval.
Test strong-model degradation. Run the same queries with your strongest available LLM in both full-context mode and memory-augmented mode. If memory augmentation degrades performance, your retrieval is introducing artifacts or version conflicts. This is the most common failure in production systems.
Iterate on the weakest dimension. Use the per-dimension scores to prioritize fixes: temporal failures need version chains, multi-hop failures need graph structure, memory awareness failures need better retrieval bridging, profile failures need aggregation across distributed interactions.
Validate with adversarial perturbation. For memory awareness questions, introduce plausible distractors that share surface-level similarity with the correct answer but violate an implicit constraint from an earlier conversation. Systems that rely purely on similarity retrieval will consistently select the distractor.
Example 1: Evaluating a Slack-bot Memory System
User: "I built a memory layer for our Slack bot that stores conversation summaries in a vector DB. Users are complaining it gives outdated answers about project status. How do I diagnose this?"
Approach:
Output:
Diagnosis Report:
- Dimension: Fine-grained Recall > Temporal
- Root cause: Vector similarity retrieves the most semantically detailed
status update (the original project plan), not the most recent one
(a brief "deadline moved to March 15" message).
- Fix: Add a `supersedes` field to memory entries. At retrieval time,
follow supersession chains to return only the terminal (latest) node.
- Validation: Create 20 temporal QA pairs where facts changed 2-4 times.
Measure % of queries returning the latest version.
Target: >90% (current baseline likely <40%).
Example 2: Building a Multi-Party Meeting Memory System
User: "I need to build a memory system for an AI assistant that sits in team meetings with 8-12 people. It should remember who said what and answer questions about decisions made across multiple meetings."
Approach:
Output:
# Memory entry schema
@dataclass
class MeetingMemory:
content: str # What was said
speaker: str # Who said it
meeting_id: str # Which meeting
timestamp: datetime # When
topic_tags: list[str] # Subject matter
supersedes: str | None # ID of earlier memory this updates
references: list[str] # IDs of memories this responds to
decision_type: str | None # "proposal" | "agreement" | "rejection" | "revision"
# Retrieval strategy per query type
RETRIEVAL_STRATEGY = {
"single_hop": "semantic_search(query, top_k=10)",
"multi_hop": "graph_traverse(seed=semantic_search(query, top_k=5), hops=2, expand_by=['references', 'speaker'])",
"temporal": "version_chain(semantic_search(query, top_k=10), prefer='latest')",
"constraint": "semantic_search(query, top_k=10) + graph_traverse(seed, filter='decision_type=rejection|revision')",
"profile": "aggregate_by_speaker(speaker_id, facets=['style', 'skill', 'role'])",
}
# Diagnostic QA generation template
diagnostic_pairs = [
# Single-hop: "What budget did Sarah propose for Q3?"
# Multi-hop: "Did engineering approve the timeline that marketing proposed in the Monday standup?"
# Temporal: "What is the current agreed-upon launch date?" (changed 3 times across meetings)
# Constraint: "Can we schedule a demo on Friday?" (someone said no client demos on Fridays, 2 weeks ago)
# Profile/Style: "Draft a message to Tom in his usual communication style."
]
Example 3: Auditing Retrieval Quality for an Existing Memory Pipeline
User: "Our conversational AI uses Mem0 for memory. It scores well on simple recall but fails on complex questions. How do I figure out what's breaking?"
Approach:
Output:
Audit Results (100 QA pairs across 6 subcategories):
Dimension | Mem0-augmented | Oracle evidence | Gap
---------------------------|----------------|-----------------|------
Single-hop recall | 82% | 98% | 16%
Multi-hop recall | 8% | 26% | 18%
Temporal recall | 15% | 54% | 39% <-- worst
Constraint awareness | 45% | 96% | 51% <-- worst
Proactive awareness | 38% | 100% | 62% <-- worst
Profile understanding | 29% | 61% | 32%
Key findings:
1. Temporal gap (39%) indicates Mem0 retrieves stale versions. Need version chains.
2. Memory awareness gap (51-62%) indicates similarity search misses implicitly
relevant memories. Need constraint-aware retrieval or graph expansion.
3. Multi-hop oracle is only 26%, meaning even with perfect retrieval,
the base model struggles. Consider chain-of-thought prompting with
explicit evidence citation to improve multi-hop reasoning.
4. Strong-model degradation detected: base Gemini scores 72% full-context
but only 58% with Mem0, confirming retrieval artifacts harm capable models.
Retrieval returns stale information: When temporal queries consistently return outdated facts, check whether your memory store has supersession pointers. If not, implement version chains and modify retrieval to follow chains to the terminal node. As a stopgap, boost recency weighting in your similarity score.
Multi-hop accuracy near zero: If multi-hop scores are below 10%, the retrieval is likely returning fragments from only one side of the reasoning chain. Implement two-stage retrieval: first retrieve seed documents, then expand the retrieval set by following reference and speaker links to related memories before passing to the LLM.
Strong model degrades with memory augmentation: This means retrieved context contains contradictory or outdated information that confuses the model. Filter retrieved memories for version conflicts before injecting them as context. When two memories contradict each other, keep only the more recent one or explicitly mark the conflict for the model.
Memory awareness scores far below oracle: The semantic gap between queries and implicitly relevant memories is too wide for similarity search. Consider augmenting queries with LLM-generated hypothetical relevant memories (HyDE-style) or maintaining explicit constraint indexes that map action types to relevant restrictions.
EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models -- Hu et al., 2026. Look for: the three-dimensional evaluation taxonomy (Section 3), the retrieval isolation methodology comparing oracle vs. augmented performance (Section 5), and the six key findings on where current memory architectures fail (Section 5.1-5.6).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".