skills/latent-briefing/SKILL.md
This skill should be used when the user asks to "share memory between agents", "KV cache compaction for multi-agent", "orchestrator worker context", "latent briefing", "reduce worker tokens", "cross-agent memory without summarization", or discusses Attention Matching compaction, recursive language models with workers, or token explosion in hierarchical agents.
npx skillsauth add muratcankoylan/Agent-Skills-for-Context-Engineering latent-briefingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Hierarchical multi-agent systems often pay for the same context twice. The orchestrator accumulates a long reasoning trajectory, but each worker usually receives only a narrow text handoff such as a subtask prompt plus raw document slices. Passing the full trajectory fixes coverage but drives token cost up on every worker call. Summarization introduces latency and information loss. Retrieval helps with document access but does not preserve the orchestrator's evolving reasoning state.
Latent Briefing addresses this by sharing memory at the representation level rather than the text level. The core idea is to compact the orchestrator trajectory in the worker model's KV cache, keeping positions that are most relevant to the current worker task. The method builds on Attention Matching (AM) KV cache compaction and adapts it for inference-time multi-agent handoff with task-guided queries, a shared token mask across heads, and robust thresholding.
Activate this skill when:
Do not activate this skill for adjacent work owned by other skills:
context-compression, memory-systems, or multi-agent-patterns.memory-systems.multi-agent-patterns.context-optimization.The token explosion pattern. In recursive or REPL-style systems, the orchestrator repeatedly calls a worker to inspect evidence, verify hypotheses, or answer subquestions. The orchestrator's trajectory grows with partial conclusions, dead ends, tool output, and prior worker responses. If that trajectory is passed in full on every worker call, cost compounds quickly.
Representation-level sharing. Instead of summarizing the trajectory into natural language, the system operates on the worker model's KV cache. It retains the positions that the worker would attend to for the current task and drops the rest. This is more specific than ordinary prefix caching: prefix caching reuses identical prefixes, while Latent Briefing also performs task-conditioned selective retention inside the reused trajectory.
Attention Matching as the compaction engine. AM seeks a smaller cache whose attention outputs approximate the full cache. Latent Briefing adapts AM for multi-agent inference by changing the scoring signal and batching strategy:
median + tau * MAD rather than fixed top-k per head.Reference result shape. The public write-up reports substantial worker-token reduction, material total-token savings, and low-single-digit-second compaction overhead on long-document QA workloads (claim-latent-briefing-public-results). Treat these numbers as workload-specific evidence, not a general guarantee.
| Approach | Primary weakness | |----------|------------------| | LLM summarization | High latency, lossy abstraction, and no guarantee the summary preserves what the next subtask needs | | Retrieval / RAG | Depends on chunking and embeddings; can miss cross-chunk or cross-step dependencies | | Pass full trajectory | Cost scales with every worker call and irrelevant context can degrade worker quality |
Latent Briefing is useful when the bottleneck is not document retrieval itself, but how to transfer orchestrator state into a worker efficiently and precisely.
Frameworks such as Recursive Language Models treat long context as an environment and recurse over it: an orchestrator decomposes work and delegates to workers. Latent Briefing fits the gap where the orchestrator has already built task-specific state that should inform the worker, but re-serializing that state as text is too expensive or noisy.
In the ideal setup, the worker maintains a persistent KV state for the orchestrator trajectory. New trajectory tokens extend that state, then compaction runs just before generation for the current subtask.
Task-guided query vectors. Use queries from the current worker task prompt, not generic samples from the context. Forward-pass the trajectory plus current task through the worker model, then score trajectory positions by how strongly the task attends to them.
Shared token selection. Aggregate scores across layers and heads into one per-position score. One shared mask enables batched operations and avoids hundreds of incompatible per-head solves.
MAD thresholding. Keep positions above a robust outlier threshold such as median + tau * MAD. Higher tau is more aggressive. Optimal settings depend on task regime, trajectory quality, and document length.
Latent Briefing is only practical when the system controls the worker inference runtime closely enough to inspect or transform KV state. It is a poor default for API-only stacks where internal KV tensors are inaccessible. It also assumes the orchestrator trajectory can be represented in the worker's model space. If orchestrator and worker differ materially in tokenizer, architecture, or attention layout, direct representation sharing may not be viable.
Choose the mechanism that matches the bottleneck:
| Need | Prefer | Why | |------|--------|-----| | Stable repeated prefix with minimal logic changes | Prefix caching | Cheapest optimization; no information loss | | Human-readable and auditable cross-step state | Structured notes or summarization | Easy to inspect and store | | Sparse lookup across a large external corpus | Retrieval / RAG | Finds documents efficiently | | Worker needs task-specific slices of orchestrator state and runtime access exists | Latent Briefing | Transfers relevant latent state without replaying all text |
Latent Briefing is not a universal replacement for summarization or retrieval. It is a specialized optimization for systems that already run a controllable orchestrator-worker stack.
Reported long-document QA results suggest:
These are tuning hypotheses, not portable laws. Re-measure on the target workload.
Scenario: orchestrator trajectory grows across worker calls
Call 1: trajectory T1 -> worker answers subquestion A
Call 2: trajectory T2 = T1 + new reasoning + reply A
compact KV(T2) using the task prompt for B
worker answers subquestion B
The task prompt for B decides which parts of T2 survive into the compacted worker state.
Negative example: API-only worker
If the worker runs behind a hosted text-generation API that does not expose KV tensors, Latent Briefing cannot be implemented directly. Use a structured text handoff from context-compression or retrieve state from memory-systems instead.
tau rarely works across long vs short context and easy vs hard tasks. Expect accuracy cliffs when compaction becomes too aggressive.Internal reference:
Related skills in this collection:
External resources:
Created: 2026-04-14 Last Updated: 2026-05-15 Author: Agent Skills for Context Engineering Contributors; primary technical source Ramp Labs (public post) Version: 1.2.0
development
A comprehensive collection of Agent Skills for context engineering, harness engineering, multi-agent architectures, and production agent systems. Use when building, optimizing, evaluating, or debugging agent systems that require effective context management and reliable operating loops.
documentation
--- name: harness-engineering description: This skill should be used when designing autonomous agent harnesses: research loops, evaluation scaffolds, locked and editable surfaces, durable logs, novelty gates, pruning, rollback, PR preparation, and human approval boundaries. --- # Harness Engineering Harness engineering designs the control system around an agent: what it may edit, how it receives feedback, where it writes state, how failures recover, and who can approve irreversible actions. Th
data-ai
Template for creating new Agent Skills for context engineering. Use this template when adding new skills to the collection.
tools
--- name: tool-design description: This skill should be used for the tool-interface layer of an agent system specifically: writing tool descriptions agents can route on, designing tool schemas and response formats, naming conventions, actionable error recovery messages, MCP server design, tool-set consolidation, and deciding when to add or remove an individual tool. Use this when the unit of work is a single tool or a set of tools. Route project-shape, pipeline architecture, and task-model-fit d