skills/always-on-agent-architecture/SKILL.md
Architecture and systems design for building always-on AI agents with episodic memory. Covers the memory hierarchy (core/recall/archival), persistence layers, agent server infrastructure, vector stores, and framework selection. Provides concrete deployment patterns for agents that maintain identity and learn across sessions. Activate on: "always-on agent", "persistent agent architecture", "episodic memory system", "agent memory design", "long-running agent", "stateful agent", "agent that remembers", "MemGPT architecture", "Letta deployment", "/always-on-agent-architecture". NOT for: choosing what data to feed the agent (use always-on-agent-inputs), brainstorming applications (use always-on-agent-applications), safety and privacy concerns (use always-on-agent-safety), general agentic patterns (use agentic-patterns).
npx skillsauth add curiositech/windags-skills always-on-agent-architectureInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are designing the architecture for an always-on AI agent with episodic memory. This is not a chatbot with a long context window. This is a system that persists state across sessions, manages its own memory hierarchy, runs as a service, and maintains identity over weeks and months. The core insight: treat the LLM as a CPU that operates on managed memory, not as a stateless function.
Q1: Do you want a full agent runtime (server, APIs, tools)?
├─ Yes → Use Letta (most complete, production-ready)
└─ No, I have my own agent loop
├─ Q2: Do you need temporal/relationship tracking?
│ ├─ Yes → Use Zep/Graphiti (best temporal knowledge graph)
│ └─ No → Go to Q3
│ ├─ Q3: Do you need graph + vector hybrid?
│ │ ├─ Yes → Use Mem0 (graph mode)
│ │ └─ No → Go to Q4
│ │ ├─ Q4: Already on LangGraph?
│ │ │ ├─ Yes → Use LangMem
│ │ │ └─ No → Use pgvector or Chroma
└─ Want zero dependencies? → Custom SQLite + local embeddings
| Trigger | Threshold | Action | |---------|-----------|--------| | Size Overflow | Core memory > 4KB | Summarize least-recent block, move summary to archival | | Age Decay | Data unused > 30 days | Mark for compaction review | | Relevance Drop | Access score < 0.3 | Move to archival memory with decay tag | | User Override | User says "forget X" | Immediate removal + archival tombstone | | Conflict Detection | Contradictory facts stored | Prompt agent to reconcile or ask user |
If query_latency_requirement < 10ms AND data_size > 100M vectors:
→ Use Qdrant (optimized for speed)
Else if already_using_postgresql:
→ Use pgvector (single DB, simpler ops)
Else if need_hybrid_search (keyword + semantic):
→ Use Weaviate (best hybrid)
Else if zero_ops_preferred:
→ Use Pinecone (fully managed)
Else:
→ Use Chroma (local-first, simple API)
Input: User message or agent observation
│
├─ Contains identity/preference update?
│ └─ Yes → Update core memory, persist immediately
├─ Requires conversation context?
│ └─ Yes → Search recall memory (conversation history)
├─ Needs factual knowledge?
│ └─ Yes → Search archival memory (vector store)
└─ External data needed?
└─ Yes → Use external tools (APIs, files, etc.)
Symptoms: Agent personality drift, contradictory responses, core memory conflicts Root Cause: Concurrent writes to core memory without locking, or failed partial updates Detection Rule: If core memory size suddenly drops >50% or contains malformed JSON/YAML Recovery Procedure:
Symptoms: Increasingly irrelevant search results, agent can't find recently stored facts Root Cause: Embedding model drift, index corruption, or no memory compaction Detection Rule: If average cosine similarity of top-3 results < 0.7 for known queries Recovery Procedure:
Symptoms: Agent hangs on memory operations, database connection timeouts Root Cause: Simultaneous read/write to same memory blocks, insufficient connection pooling Detection Rule: If memory operation takes >30s or database shows lock wait timeouts Recovery Procedure:
Symptoms: API costs spike, response latency increases, token limit errors Root Cause: Core memory bloat, retrieving too many archival chunks per query Detection Rule: If average tokens per request > 80% of model's context limit Recovery Procedure:
Symptoms: Database size grows linearly, search performance degrades over time Root Cause: No memory compaction, duplicate fact insertion, missing garbage collection Detection Rule: If total memory size grows >100MB/month with normal usage Recovery Procedure:
Scenario: Design architecture for an agent that helps with technical research, remembers your preferences, and builds knowledge over months.
Step 1 - Memory Tier Design
Core Memory (2KB):
- User name: "Sarah"
- Research domains: ["machine learning", "distributed systems"]
- Preferred paper sources: ["arxiv", "acm digital library"]
- Writing style: "detailed with code examples"
- Current project: "distributed training optimization"
Recall Memory:
- All conversations in PostgreSQL with full-text search
- 30-day retention window, then summarized
Archival Memory:
- Paper summaries, extracted insights, code snippets
- pgvector on PostgreSQL (already using it for recall)
- nomic-embed-text for local embedding (privacy + cost)
Step 2 - Framework Selection Decision Following decision tree:
Step 3 - Agent Loop Implementation
async def research_step(user_query: str):
# Load core memory
core = load_core_memory() # User prefs, active project
# Check if query relates to current project
if "optimization" in user_query.lower():
# Search archival for project-specific knowledge
relevant_papers = search_archival("distributed training optimization")
context = f"Current project context: {relevant_papers}"
else:
# Search for general domain knowledge
context = search_archival(user_query)
# Build prompt with core memory + retrieved context
system_prompt = f"""
You are Sarah's research assistant.
User preferences: {core['preferences']}
Current project: {core['current_project']}
Retrieved context: {context}
"""
response = await llm.chat([
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_query}
])
# Persist interaction
save_to_recall(user_query, response)
return response
What a novice would miss:
What an expert catches:
Do NOT use this skill for:
/always-on-agent-inputs instead - that skill covers what data to feed the agent, this covers how to store and retrieve it/always-on-agent-applications instead - that skill covers use case ideation, this covers technical implementation/always-on-agent-safety instead - that skill covers data governance, consent, and security; this assumes those are already designed/agentic-patterns instead - that skill covers ReAct loops, tool use, planning; this covers the persistence layer underneath/agent-creator instead - if the agent doesn't need to remember across sessions, you don't need always-on architecturetools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.