
Universal LaTeX document skill: create, compile, and convert any document to professional PDF with PNG previews. Supports resumes, reports, cover letters, invoices, academic papers, theses/dissertations, academic CVs, presentations (Beamer), scientific posters, formal letters, exams/quizzes, books, cheat sheets, reference cards, exam formula sheets, fillable PDF forms (hyperref form fields), conditional content (etoolbox toggles), mail merge from CSV/JSON (Jinja2 templates), version diffing (latexdiff), charts (pgfplots + matplotlib), tables (booktabs + CSV import), images (TikZ), Mermaid diagrams, AI-generated images, watermarks, landscape pages, bibliography/citations (BibTeX/biblatex), multi-language/CJK (auto XeLaTeX), algorithms/pseudocode, colored boxes (tcolorbox), SI units (siunitx), Pandoc format conversion (Markdown/DOCX/HTML ↔ LaTeX), and PDF-to-LaTeX conversion of handwritten or printed documents (math, business, legal, general). Compile script supports pdflatex, xelatex, lualatex with auto-detection, latexmk backend, texfot log filtering, PDF/A output, and verbosity control (--verbose/--quiet). Empirically optimized scaling: single agent 1-10 pages, split 11-20, batch-7 pipeline 21+. Use when user asks to: (1) create a resume/CV/cover letter, (2) write a LaTeX document, (3) create PDF with tables/charts/images, (4) compile a .tex file, (5) make a report/invoice/presentation, (6) anything involving LaTeX or pdflatex, (7) convert/OCR a PDF to LaTeX, (8) convert handwritten notes, (9) create charts/graphs/diagrams, (10) create slides, (11) write a thesis or dissertation, (12) create an academic CV, (13) create a poster, (14) create an exam/quiz, (15) create a book, (16) convert between document formats (Markdown, DOCX, HTML to/from LaTeX), (17) generate Mermaid diagrams for LaTeX, (18) create a formal business letter, (19) create a cheat sheet or reference card, (20) create an exam formula sheet or crib sheet, (21) condense lecture notes/PDFs into a cheat sheet, (22) create a fillable PDF form with text fields/checkboxes/dropdowns, (23) create a document with conditional content/toggles (show/hide sections), (24) generate batch/mail-merge documents from CSV/JSON data, (25) create a version diff PDF (latexdiff) highlighting changes between documents, (26) create a homework or assignment submission with problems and solutions, (27) create a lab report with data tables, graphs, and error analysis, (28) encrypt or password-protect a PDF, (29) merge multiple PDFs into one, (30) optimize/compress a PDF for web or email, (31) lint or check a LaTeX document for common issues, (32) count words in a LaTeX document, (33) analyze document statistics (figures, tables, citations), (34) fetch BibTeX from a DOI, (35) convert a Graphviz .dot file to PDF/PNG, (36) convert a PlantUML .puml file to PDF/PNG, (37) create a one-pager/fact sheet/executive summary, (38) create a datasheet or product specification sheet, (39) extract pages from a PDF (page ranges, odd/even), (40) check LaTeX package availability before compiling, (41) analyze citations and cross-reference with .bib files, (42) debug LaTeX compilation errors, (43) make a document accessible (PDF/A, tagged PDF), (44) create lecture notes or course handouts, (45) fill an existing PDF form (fillable fields or non-fillable with annotations), (46) extract text or tables from a PDF (pdfplumber, pypdf), (47) OCR a scanned PDF to text (pytesseract), (48) create a PDF programmatically with reportlab (Canvas, Platypus), (49) rotate or crop PDF pages (pypdf), (50) add a watermark to an existing PDF, (51) extract metadata from a PDF (title, author, subject).
Generate fine-grained evaluation rubrics for medical dialogue systems using a retrieval-augmented multi-agent pipeline. Decomposes medical evidence into atomic facts, synthesizes them with interaction constraints, and produces weighted, auditable rubrics. Use when: 'evaluate medical chatbot responses', 'generate rubrics for clinical QA', 'build a medical LLM evaluation pipeline', 'score health dialogue quality', 'create automated clinical evaluation criteria', 'refine medical AI responses with rubric feedback'.
Compress long user contexts (profiles, histories, documents) into concise, high-quality summaries using attention-guided importance marking. Use when asked to: 'compress this user profile', 'shorten this context for the prompt', 'reduce token usage for personalization', 'summarize interaction history preserving key signals', 'fit this long context into a token budget', 'extract the most relevant parts of this document for a query'.
Design and execute long-horizon, cross-application agent workflows for professional knowledge work (finance, consulting, legal). Applies the APEX-Agents benchmark methodology to structure multi-step tasks that span files, spreadsheets, documents, email, calendars, and code execution within realistic work environments. Trigger phrases: - "Build an agent workflow for this banking/consulting/legal task" - "Create a cross-application task pipeline" - "Design a multi-step professional workflow with rubric evaluation" - "Set up an Archipelago-style sandboxed agent environment" - "Evaluate agent performance on a long-horizon task" - "Break this professional task into rubric-graded criteria"
Build agentic recommendation systems that learn collaborative filtering signals through cross-user memory evolution -- no CF model pre-training needed. Use when: 'build a recommender with memory', 'add collaborative filtering to LLM recommendations', 'cross-user pattern memory pool', 'agentic recommender system', 'memory-augmented ranking', 'evolving user behavior patterns for recommendations'.
Build adaptive memory systems using coordinated multi-agent collaboration with hierarchical storage and consistency maintenance. Use when: 'build a memory system for my chatbot', 'add long-term memory to my agent', 'implement multi-granularity retrieval', 'create a memory-augmented LLM pipeline', 'handle memory conflicts in conversational AI', 'reduce context window usage with smart retrieval'.
Build LLM-based multi-agent systems for supply chain inventory management using structured decision prompts and memory-retrieval (AIM-RM). Implements the beer game multi-echelon supply chain simulation with per-stage agents that use stepwise ordering prompts, safety-stock calculations, and Euclidean-distance memory retrieval of similar historical episodes. Use when asked to: "build a supply chain agent", "implement inventory management with LLMs", "create a beer game simulation with AI agents", "multi-agent ordering system", "AIM-RM memory retrieval agent", "supply chain decision prompt design".
Reverse-engineer black-box agentic systems into editable, interpretable workflows using search-based reconstruction. Use when the user says 'reconstruct this agent workflow', 'reverse-engineer this pipeline', 'white-box this agentic system', 'explain what this agent chain is doing', 'approximate this black-box agent', or 'build an interpretable surrogate for this system'.
Implement the EXPERIGEN agentic framework for automated hypothesis generation and empirical validation on datasets. Uses a Bayesian-optimization-inspired Generator-Experimenter loop to discover statistically significant, novel hypotheses from data. Trigger phrases: 'generate hypotheses from this dataset', 'discover patterns in social data', 'run EXPERIGEN on this data', 'automated hypothesis testing', 'find significant predictors in this dataset', 'data-driven hypothesis discovery'.
Automated ArkTS code repair using retrieval-augmented generation, LLM-based test oracle synthesis, and structured benchmark evaluation for HarmonyOS development. Use when: 'fix this ArkTS error', 'repair HarmonyOS code', 'convert TypeScript to ArkTS', 'ArkTS compilation error', 'debug HarmonyOS component', 'generate tests for ArkTS code'.
Build adaptive, cost-aware Graph-RAG pipelines that route queries through escalating retrieval stages (local -> bridge -> global) with triple-check verification and provenance map-back. Use when: 'build a graph RAG pipeline', 'implement adaptive retrieval for knowledge graphs', 'cost-aware multi-hop question answering', 'add evidence verification to RAG', 'handle mixed-difficulty queries efficiently', 'graph retrieval with source text grounding'.
Execute large-scale breadth-oriented search and retrieval tasks using the A-MapReduce pattern: decompose a wide query into a task matrix, dispatch parallel map agents for independent retrieval, then reduce partial results into a unified structured table. Triggers: 'find all X that match Y across a large set', 'build a comparison table of N items', 'search for every instance of X', 'collect attributes for a list of entities', 'wide search across many sources', 'gather structured data on hundreds of items'.
Explain generative AI outputs using the gSMILE perturbation-based attribution framework. Builds local surrogate models from controlled input perturbations and Wasserstein distance to produce token-level or word-level importance scores for LLM and diffusion model outputs. Triggers: 'explain why the model generated this', 'token attribution for prompt', 'which words in my prompt matter most', 'interpret generative model output', 'build explainability for my LLM pipeline', 'debug prompt influence on generation'
Adaptive multi-step tool orchestration for complex reasoning tasks. Dynamically selects, sequences, and composes tools based on task context and intermediate results rather than fixed pipelines. Use when: 'orchestrate tools for this task', 'figure out which tools to use', 'multi-step reasoning with tools', 'adaptive tool pipeline', 'dynamic tool selection', 'chain tools together intelligently'.
Multi-agent code generation using structured debate with adaptive confidence gating. Three specialized agents (User/Product, Technical, QA) debate before generating code, with a 95% confidence threshold to skip debate on simple tasks and a reviewer-guided debugging loop for post-generation refinement. Use when: 'generate code with multi-agent debate', 'use DebateCoder approach', 'code generation with confidence gating', 'multi-agent collaborative coding', 'debate-driven code synthesis', 'adaptive confidence code generation'.
Red-team and harden AI voice agents and LLM-powered service systems against adversarial misuse using the Aegis framework. Evaluates authentication bypass, privacy leakage, privilege escalation, data poisoning, and resource abuse risks. Use when: 'red-team my voice agent', 'security audit my AI call center', 'harden my LLM agent against prompt injection', 'test my chatbot for privilege escalation', 'add layered defenses to my AI service', 'evaluate my agent for data leakage'.
Audit LLM agent systems for trust-boundary security vulnerabilities using the AgentFence taxonomy of 14 attack classes across planning, memory, retrieval, tool use, and delegation. Produces trace-auditable security reports with mean security break rates (MSBR) per attack surface. Use when: "audit my agent for security vulnerabilities", "check agent trust boundaries", "find delegation attacks in my agent code", "map security risks in my LangGraph/CrewAI pipeline", "evaluate agent architecture security", "run AgentFence analysis on this agent system".
Generate structured autonomous driving scenarios and MCQ benchmarks using AgentDrive's factorized 7-axis prompt-to-JSON pipeline. Use when asked to: 'generate driving scenarios for AV testing', 'create autonomous driving benchmarks', 'build LLM evaluation datasets for driving', 'design safety-critical simulation scenarios', 'evaluate LLM reasoning on driving tasks', 'create multiple-choice questions for autonomous systems'.
Agentic secure code review for detecting immature vulnerabilities at pre-commit stage. Uses a two-phase Detector-Validator pipeline with SAST-rule semantic memory and CWE-tree validation to localize, classify, and explain security weaknesses in code diffs. Trigger phrases: "review this diff for security issues", "secure code review", "find vulnerabilities in my changes", "pre-commit security check", "check this PR for security weaknesses", "agentic security review"
Build agentic systems for understanding very long video streams (hours to weeks) using entity scene graphs, multi-tool planning agents, and hybrid cross-modal search. Use when: 'build a video understanding agent', 'analyze long egocentric video', 'entity graph from video', 'multi-hop video QA', 'search across hours of footage', 'temporal reasoning over video'.
Build agentic LLM-driven robotic manipulation pipelines using the ALRM framework pattern: a ReAct-style reasoning loop with dual execution modes (Code-as-Policy for direct code generation, Tool-as-Policy for iterative tool-based execution). Generates modular robot control code with closed-loop planning, observation, and replanning. Trigger phrases: "robot manipulation agent", "agentic robot control", "ReAct robot planner", "code-as-policy generation", "tool-as-policy robot", "LLM robotic pipeline"
Build multi-agent adaptive learning systems that diagnose knowledge gaps and recommend targeted resources. Implements the ALIGNAgent framework: Skill Gap Agent (proficiency estimation + concept-level diagnostic reasoning) and Recommender Agent (preference-aware resource retrieval aligned to deficiencies). Trigger phrases: - "Build an adaptive learning system" - "Create a personalized tutoring agent" - "Diagnose student knowledge gaps from quiz data" - "Build a skill gap analyzer for learners" - "Create an educational recommender that adapts to student performance" - "Implement a multi-agent pipeline for personalized education"
Multi-agent framework for scoring subjective, open-ended responses (interviews, essays, reflections) using transcript refinement + criterion-specific parallel scoring with calibrated few-shot examples. Use when: 'score these interview responses', 'evaluate candidate answers', 'grade these essays on a rubric', 'assess soft skills from text', 'build an automated scoring pipeline', 'rate open-ended responses against criteria'.
Build robust web automation agents using Mixture of Grounding Experts, experience-imitation planning, and task-tracking checklists. Use when: 'build a web agent', 'automate browser tasks with grounding', 'create a web scraping agent with memory', 'implement element grounding for web automation', 'build a multi-step web task agent', 'add procedural knowledge to a browser agent'.
Generate secure code using Stream of Revision — an in-decoding self-correction technique that backtracks and patches vulnerable code spans during generation rather than after it. Trigger phrases: - "generate secure code" - "fix security vulnerabilities in this code" - "write safe C/C++ code" - "review and revise code for security" - "backtrack and fix this vulnerability" - "self-correcting code generation"
Detect reward hacking in AI-generated code trajectories using contrastive analysis from the TRACE benchmark. Use when: 'check this code agent for reward hacking', 'detect if these test results are gamed', 'audit coding agent trajectories', 'find reward exploits in RL-generated code', 'contrastive analysis on code submissions', 'are these tests being manipulated'.
Decouple evidence access from evidence use when evaluating or building long-context and RAG systems under semantic interference. Use this skill when the user says: 'evaluate my RAG pipeline against hard negatives', 'stress-test retrieval with semantic distractors', 'build a decoupled retrieval benchmark', 'diagnose why my long-context QA is failing', 'create collision-tested evaluation data', 'measure evidence access vs answer quality separately'.
Build query-adaptive knowledge graph retrieval systems using CatRAG's context-aware traversal. Transforms static KG-based RAG pipelines into dynamic, query-sensitive retrieval that recovers complete multi-hop evidence chains. Use when: 'build a multi-hop RAG pipeline', 'improve knowledge graph retrieval', 'fix semantic drift in graph search', 'implement context-aware graph traversal', 'retrieve complete evidence chains from a KG', 'add query-dependent edge weighting to my graph'.
Audit and harden Model Context Protocol (MCP) server deployments against protocol-level vulnerabilities including capability attestation bypass, sampling injection, and cross-server trust propagation. Use when: 'audit my MCP server security', 'harden MCP deployment', 'check for prompt injection in MCP tools', 'add authentication to MCP messages', 'secure multi-server MCP setup', 'analyze MCP attack surface'.
Implement C²RoPE (Causal Continuous Rotary Positional Encoding) for multimodal transformers that process 2D/3D visual data alongside text. Replaces standard 1D RoPE with a triplet (m, x, y) positional index and Chebyshev causal masking to preserve spatial locality in vision-language models. Trigger phrases: - "implement C2RoPE positional encoding" - "fix spatial locality loss in vision-language RoPE" - "add 2D-aware rotary embeddings for image tokens" - "implement Chebyshev causal masking for visual attention" - "modify RoPE for multimodal 3D reasoning" - "spatially-aware positional encoding for multi-view images"
LLM-driven data preparation pipeline for cleaning, integrating, and enriching messy datasets. Use when the user says 'clean this data', 'fix this CSV', 'match these schemas', 'deduplicate these records', 'impute missing values', or 'annotate this table'.
Perform rigorous causal inference tasks using structured reasoning pipelines inspired by CauGym. Estimate treatment effects (ATE, CDE, ETT, NDE, NIE), compute probabilities of necessity/sufficiency, apply the backdoor criterion for deconfounding, and build causal DAGs from domain knowledge. Trigger phrases: 'estimate causal effect', 'what is the treatment effect of', 'causal reasoning over this data', 'apply backdoor adjustment', 'counterfactual analysis', 'build a causal DAG for this problem'.
Solve complex problems by switching between four cognitive mindsets (Spatial, Convergent, Divergent, Algorithmic) at each reasoning step, adapting the thinking mode to what the current sub-problem actually needs. Use when asked to: 'reason through this step by step', 'solve this hard problem', 'think carefully about this', 'use chain of mindset', 'adaptive reasoning', 'multi-step problem solving'.
Diagnose and correct causal reasoning failures in LLM outputs using the CausalT5K framework. Detects rung collapse (answering causal questions with mere correlations), sycophantic drift (abandoning correct answers under pressure), and generates Wise Refusals that specify missing evidence. Trigger phrases: 'diagnose causal reasoning', 'check for rung collapse', 'detect sycophancy in reasoning', 'wise refusal analysis', 'causal trap detection', 'audit causal claims in this output'
Apply introspective visual thinking by iteratively 'chatting with images' — using language-guided re-examination of visual content to reason over fine-grained details, spatial relationships, and multi-image comparisons. Use when: 'analyze this image in detail', 'compare these images', 'reason about spatial layout', 'what's different between these screenshots', 'explain the visual relationship', 'trace the visual logic step by step'.
Build systems that detect and fix reasoning gaps in LLM agents by comparing their chain-of-thought against reference reasoning, extracting structured discrepancies, and generating corrective instructions stored in a retrievable knowledge base. Use when: 'build a reasoning improvement pipeline', 'detect logic gaps in agent output', 'compare agent reasoning to expert reasoning', 'create a corrective knowledge base from reasoning errors', 'improve clinical decision support accuracy', 'patch reasoning with RAG-retrieved instructions'.
Build semantic component interfaces that expose UI components as structured tool primitives for AI agent automation. Use when: 'make my UI agent-friendly', 'add CI4A interfaces to components', 'create semantic wrappers for web components', 'build agent-accessible UI toolkit', 'expose component actions as tools', 'wrap Ant Design components for agent use'.
Assess LLM-generated code correctness using attribution graph analysis inspired by mechanistic interpretability. Apply structural reasoning diagnostics to identify buggy logic, predict failure modes, and suggest targeted fixes. Use when: 'analyze this code for correctness', 'why might this generated code be wrong', 'find structural bugs in this function', 'verify this algorithm logic', 'diagnose why this code fails', 'check this code for logical errors without running it'.
Build fast similarity-search systems using compact binary hypercube embeddings derived from foundation model encoders. Replaces brute-force cosine similarity over float vectors with Hamming distance over binary codes for orders-of-magnitude speedup and memory reduction. Trigger phrases: 'binary hashing for retrieval', 'fast embedding search', 'compact embeddings for similarity', 'Hamming space retrieval', 'hash-based vector search', 'reduce embedding memory footprint'
Assess patch correctness using contrastive learning on code representations. Applies semantic-preserving code transformations and multi-view embedding comparison to determine whether a code patch is genuinely correct or merely overfitting to test suites. Trigger phrases: 'is this patch correct', 'check patch correctness', 'assess this fix', 'validate this bug fix', 'detect overfitting patch', 'contrastive patch assessment'
Generate and evaluate visual metonymy -- indirect visual representations that evoke concepts through associated cues rather than literal depiction. Uses a semiotic-theory-grounded pipeline (representamen generation, chain-of-thought visual description, image synthesis) to create images where meaning is implied, not shown. Trigger phrases: 'generate visual metonymy', 'create indirect visual representation', 'visual metaphor pipeline', 'metonymic image generation', 'semiotic image prompt', 'evoke concept visually without showing it'.
Build multi-agent workflows structured as constrained DAG process maps with Monte Carlo uncertainty estimation. Each agent occupies a specialized review role, with predefined escalation paths and terminal states (automated label or human review). Use when: 'build a multi-agent compliance pipeline', 'add uncertainty-aware escalation to my agent workflow', 'create a review chain with human fallback', 'design a DAG-based agent workflow with confidence thresholds', 'implement Monte Carlo sampling for agent decisions', 'build a multi-stage content moderation system'.
Perform context-sensitive pointer analysis for ArkTS/TypeScript code targeting OpenHarmony. Build precise call graphs, resolve indirect calls through closures and framework APIs, and detect vulnerability patterns. Use when the user asks to 'analyze ArkTS pointer flow', 'build a call graph for OpenHarmony app', 'resolve indirect calls in TypeScript', 'find data flow vulnerabilities in ArkTS', 'model ArkUI component state propagation', or 'reduce false positives in static analysis'.
Mitigate contextual drag — the phenomenon where failed attempts in conversation context bias LLM reasoning toward structurally similar errors. Apply context-denoising and fallback-reasoning strategies when iterating on broken code, debugging multi-step failures, or refining solutions that keep failing in similar ways. Trigger phrases: - "I keep getting the same kind of error" - "My fix attempt made it worse" - "The refactored code has the same bug pattern" - "Each iteration introduces a similar failure" - "Self-correction loop isn't converging" - "Why does my retry keep failing the same way"
Implement CoPE (Clipped RoPE) soft clipping of low-frequency rotary positional embedding components to extend LLM context length without retraining. Use when: 'extend context window with CoPE', 'apply soft clipping to RoPE', 'fix long context degradation', 'implement CoPE positional embedding', 'scale RoPE to longer sequences', 'add cosine-decay frequency clipping'
Apply CRAFT (Calibrated Reasoning with Answer-Faithful Traces) for multi-hop question answering with verified reasoning chains. Use when: 'answer a complex question from multiple documents', 'build a faithful multi-hop QA pipeline', 'reason over retrieved passages with citations', 'verify reasoning chain faithfulness', 'structured RAG with auditable traces', 'multi-step question decomposition with evidence grounding'
Build multi-agent collaboration systems with experience-driven co-evolution using HTN skill libraries and post-episode optimization. Use when: 'build a multi-agent system that improves over episodes', 'create agents that coordinate in real-time with low latency', 'design a skill library for collaborative agents', 'implement experience-based co-evolution for agent teams', 'optimize multi-agent token budget while improving performance', 'set up HTN-based task decomposition for cooperative agents'.
Compress chain-of-thought reasoning using CtrlCoT's dual-granularity framework: hierarchical semantic abstraction combined with logic-preserving token pruning. Use when asked to 'compress reasoning', 'shorten chain of thought', 'optimize CoT tokens', 'reduce reasoning verbosity', 'budget-controlled reasoning', or 'efficient step-by-step thinking'.
Evolutionary multi-agent code optimization using genetic algorithms. Agents mutate each other's training/configuration code, benchmark results, and select survivors across generations. Use when: 'evolve my training config', 'optimize this code with genetic search', 'set up evolutionary hyperparameter tuning', 'multi-agent code mutation pipeline', 'self-improving training loop', 'darwin-style evolutionary optimization'.
Decompose complex, long-horizon coding tasks into PR-like chains of verifiable subtasks with cross-stage dependency tracking and iterative refinement. Use when: 'break this feature into PRs', 'plan a multi-step implementation', 'decompose this project into verifiable stages', 'help me build this feature incrementally with tests at each step', 'create a chain of PRs for this task', 'plan an implementation with bug-fix iterations'.
Audit and harden tool-augmented AI agent systems against Tag-Along Attacks -- adversarial agent-to-agent jailbreaks that exploit legitimate tool privileges through conversation. Use when asked to 'red-team an agent system', 'audit agentic tool safety', 'design verifiable agent security tests', 'build a tag-along attack harness', 'harden an operator agent against privilege escalation', or 'evaluate agent-to-agent attack surfaces'.
Build agentic image retrieval systems that perform multi-step contextual reasoning over visual histories instead of isolated semantic matching. Use when: 'build a context-aware image search agent', 'retrieve images using temporal reasoning', 'search photos by contextual clues across events', 'implement dual-memory agent for image retrieval', 'create a visual history exploration pipeline', 'benchmark multimodal agents on retrieval tasks'.
Evaluate dialogue summaries using the DIAL-SUMMER hierarchical error taxonomy. Detects 10 fine-grained error types across two levels: dialogue-level (speaker/turn structure) and within-turn-level (semantic content). Use when: 'evaluate this meeting summary', 'check this dialogue summary for errors', 'find hallucinations in this conversation summary', 'audit this chat transcript summary', 'grade this call center summary', 'detect speaker misattribution in this summary'.
Apply Step-wise Marginal Information Gain (MIG) credit assignment to multi-step reasoning tasks. Evaluates each reasoning step by its marginal contribution toward the correct answer rather than by position or final outcome alone. Use this skill when asked to: "evaluate my chain-of-thought", "score each reasoning step", "find where my logic goes wrong", "credit assign my solution steps", "debug my multi-step reasoning", "identify which steps actually matter in this derivation".
Design and implement multi-agent workflows using the DeepDiver hierarchical orchestration pattern with diffusion-inspired parallel planning. Applies DLLM Agent principles -- global planning signals, reduced backtracking, span-aware execution, and structured tool-call hardening -- to build agent pipelines that converge faster on correct action paths. Use when: 'build an agent pipeline with planner and workers', 'reduce backtracking in my agent loop', 'design a hierarchical agent workflow', 'optimize multi-step tool-use agent', 'implement DeepDiver-style agent orchestration', 'harden tool calls in my agent system'.
Audit and harden the moral robustness of Vision-Language Model (VLM) pipelines against adversarial perturbations that flip ethical judgments. Implements perturbation probes, flip-rate measurement, and inference-time defenses from Liu et al. (2026). Use when: 'test VLM moral robustness', 'audit VLM safety', 'harden VLM ethical judgments', 'probe model moral consistency', 'red-team VLM morality', 'evaluate VLM alignment stability'.
Build scope-matched knowledge graph RAG pipelines where retrieval precision beats breadth. Constructs domain-specific KGs from scientific literature, selects scope-aligned subgraphs for retrieval, and injects focused context into LLM prompts — avoiding the accuracy loss caused by union-based retrieval. Use when: "Build a knowledge graph RAG pipeline for medical questions", "Add domain-specific retrieval to my LLM app", "My RAG pipeline returns too much irrelevant context", "Help me scope my knowledge graph to match my query domain", "Design a biomedical QA system with knowledge graphs", "Reduce noise in my retrieval-augmented generation system"
Synthesize formal programming specifications from natural-language API docs using grammar induction. Extracts rules from documentation, induces a domain-specific specification grammar (EBNF), and generates validated formal specs. Use when: 'formalize this API documentation', 'extract specifications from these docs', 'generate formal rules from this README', 'convert these requirements to formal specs', 'induce a grammar for these programming rules', 'validate API usage against documentation'.
Structured rebuttal and critique-response generation using the DRPG framework (Decompose, Retrieve, Plan, Generate). Breaks down multi-point feedback into atomic concerns, retrieves supporting evidence from source documents, selects a rebuttal strategy (clarification vs. justification), and generates targeted point-by-point responses. Trigger phrases: - "Write a rebuttal to this review" - "Respond to reviewer comments" - "Address this feedback point by point" - "Help me respond to this code review / PR review" - "Draft a response to these critique points" - "Decompose this review and plan responses"
Build dialect-aware RAG conversational agents that handle non-standard orthography, code-switching, and multi-script input. Uses a dual-path architecture: deterministic NLU for structured flows + RAG fallback for open-domain queries. Trigger phrases: 'build a dialect chatbot', 'RAG agent for Arabic dialect', 'handle code-switching in chatbot', 'multi-script NLU pipeline', 'Algerian Arabic conversational agent', 'dialect-aware customer service bot'
Deploy and optimize small language models (SLMs) as autonomous agents using the effGen framework. Implements prompt compression (70-80% context reduction), five-factor complexity routing, intelligent task decomposition, and unified memory for local SLM-based agent systems. Triggers: 'set up effgen agent', 'deploy small language model agent', 'optimize prompts for small model', 'compress agent context for SLM', 'build local AI agent with effgen', 'route tasks by complexity for small models'
Build and configure ECHO-style research platforms for running reproducible user studies comparing chat-based AI and web search interactions. Use when: 'set up a user study platform', 'build a chat vs search experiment', 'log participant interactions with LLMs', 'create a research workflow with surveys and tasks', 'export user study interaction traces', 'configure an IRB-compliant experiment with pre/post questionnaires'.
Detect phishing URLs using Least-to-Most iterative decomposition with answer sensitivity scoring. Triggers: 'analyze this URL for phishing', 'is this URL safe', 'check URL for phishing indicators', 'phishing detection', 'classify this URL as phishing or benign', 'scan URL for suspicious patterns'
Build verifiable enterprise GUI agent benchmarks using schema-grounded task generation and SQL-based deterministic verification. Use when: 'generate enterprise test tasks from a database schema', 'build SQL verification for GUI agent tasks', 'create benchmark for CRM/ERP/ITIL agents', 'reverse-engineer business logic from DB schema', 'validate agent actions with database state checks', 'set up dockerized enterprise benchmark environments'.
Solve mathematical problems using IIPC (Iteratively Improved Program Construction) -- a dual-branch approach that combines iterative code execution with independent chain-of-thought reasoning, then merges both for a verified answer. Trigger phrases: "solve this math problem with code verification", "math reasoning with execution feedback", "verify this calculation programmatically", "solve and check with code", "iterative math solving", "math problem with dual reasoning"
Build and evaluate long-term memory systems for conversational agents using the ES-MemEval five-capability framework (information extraction, temporal reasoning, conflict detection, abstention, user modeling). Use when: 'evaluate my chatbot memory', 'build long-term user memory for my agent', 'benchmark conversational memory capabilities', 'add personalization memory to my dialogue system', 'detect contradictions in user history', 'implement temporal reasoning over chat sessions'.
Design and optimize Kubernetes-native GenAI inference platforms using Kueue job queuing, Dynamic Accelerator Slicer (DAS) GPU partitioning, and Gateway API Inference Extension (GAIE) with llm-d for multi-stage AI pipelines. Use when: 'set up Kubernetes for AI inference', 'configure Kueue for batch GPU jobs', 'partition GPUs with MIG slicing on Kubernetes', 'optimize LLM inference routing on Kubernetes', 'build a Whisper-to-LLM pipeline on K8s', 'reduce TTFT latency for LLM serving'.
Build and evaluate cross-dataset entity matching pipelines for recommender systems. Implements the Reddit-Amazon-EM methodology: rule-based, lexical, embedding-based, graph neural, and LLM-based entity matching with systematic evaluation. Use when: 'match products across catalogs', 'link entities between datasets', 'deduplicate items across platforms', 'entity resolution for recommendations', 'cross-dataset product mapping', 'evaluate entity matching methods'.
Build statistically efficient LLM evaluation pipelines that combine direct accuracy with pairwise comparison signals as control variates. Use when the user asks to 'evaluate LLM accuracy on a benchmark', 'rank models with small sample sizes', 'reduce variance in LLM evaluation', 'build a model comparison pipeline', 'get tighter confidence intervals for model performance', or 'statistically compare reasoning models'.
Build event-driven video stream processing pipelines that detect meaningful state transitions instead of processing every frame. Use when asked to: 'build a real-time video understanding system', 'detect events in a video stream', 'process long video with memory', 'reduce redundant frame processing', 'stream video to LLM efficiently', 'build an event-aware video pipeline'.
Apply Agent Reasoning Reward Model (Agent-RRM) structured critique to improve multi-step agent trajectories. Evaluates tool-use chains with explicit reasoning traces, focused critiques, and process scores. Use this skill when: - "Critique this agent's reasoning trace" - "Evaluate my tool-calling workflow and find flaws" - "Score this multi-step agent trajectory" - "Help me build a reward model for agent training" - "Improve this agent's reasoning with structured feedback" - "Debug why my agent pipeline produces wrong answers"
Implement Multi-head Explicit Attention (MEA) with inter-head interaction for Transformer models. Adds Head-level Linear Composition (HLC) modules and head-level Group Normalization to standard multi-head attention, enabling cross-head communication, faster convergence with larger learning rates, and 50% KV-cache compression via virtual heads. Trigger phrases: - "Add inter-head interaction to my attention layer" - "Implement MEA attention with head-level linear composition" - "Compress KV-cache using virtual heads" - "Replace multi-head attention with explicit cross-head mixing" - "Add head-level normalization to my Transformer" - "Reduce KV-cache memory with low-rank head reconstruction"
Build intelligent trigger-action automation systems using FARM's two-stage architecture: contrastive retrieval + multi-agent LLM selection with field-level configuration. Use when asked to 'create an IFTTT-style automation', 'build a trigger-action pipeline', 'connect event triggers to actions with field bindings', 'design a multi-agent workflow resolver', 'generate executable automation rules from natural language', or 'wire up webhook triggers to API actions'.
Diagnose why generated code fails and apply the right fix strategy (self-critique, RAG, multi-model, or progressive prompting) based on a data-driven decision framework from empirical research on 25 GitHub projects. Trigger phrases: "my generated code doesn't work", "fix this code generation failure", "why does this code keep failing", "help me debug LLM-generated code", "improve code generation quality", "the AI-generated code is wrong"
Adaptive Markdown output formatting that balances semantic fidelity with structural correctness. Applies the FMBench methodology to detect and fix broken lists, malformed tables, inconsistent headings, and invalid code blocks in LLM-generated Markdown. Use this skill when asked to: - "Format this output as clean Markdown" - "Fix the formatting in this document" - "Generate a well-structured report with tables and code blocks" - "Convert this content into properly nested Markdown" - "Clean up this Markdown — the lists and headings are broken" - "Reformat this response so it renders correctly"
Generate optimized GPU kernels in Triton from PyTorch reference code using the Makora RL-based iterative refinement workflow. Applies tool-augmented kernel generation with correctness validation and performance benchmarking. Use when: 'write a Triton kernel for this PyTorch op', 'optimize this GPU kernel', 'generate a fast Triton implementation of matrix multiply', 'convert this PyTorch module to a custom Triton kernel', 'speed up this CUDA operation with Triton', 'benchmark my kernel against TorchInductor'.
Build uncertainty-aware planners for multi-agent systems using the PCE (Planner-Composer-Evaluator) decision tree framework. Converts implicit LLM reasoning assumptions into scored decision trees that select actions under uncertainty without heavy inter-agent communication. Use when: 'build an agent that plans under uncertainty', 'create a decision tree from assumptions', 'multi-agent planning with partial observability', 'reduce agent communication overhead', 'score actions by likelihood and cost', 'uncertainty-aware action selection'.
Security audit and hardening for personalized LLM-based agents against prompt injection, tool poisoning, and memory attacks. Use when: 'audit my agent for security vulnerabilities', 'test my AI assistant against prompt injection', 'harden my agent toolchain', 'evaluate memory poisoning risks in my agent', 'red-team my personalized AI agent', 'add defenses against indirect prompt injection'.
Reverse-engineer game mechanics from gameplay traces using a two-stage causal induction pipeline: first infer a Structural Causal Model (SCM) from observations, then translate it into executable game rules (VGDL or equivalent). Trigger phrases: 'infer game mechanics from traces', 'reverse-engineer game rules', 'build causal model from gameplay', 'extract game logic from observations', 'generate VGDL from gameplay', 'causal induction for games'
Diagnose and explain failures in agentic AI systems using trace-based rubric evaluation, bridging static feature attribution (SHAP/LIME) with trajectory-level diagnostics. Use when: 'debug why my agent failed', 'explain agent behavior', 'evaluate agent traces', 'add explainability to my agent pipeline', 'diagnose agentic failures', 'trace-based agent analysis'.
File-system-based dual-agent deep research framework that scales beyond context windows. Separates evidence gathering (Context Builder) from report writing (Report Writer) using a persistent hierarchical knowledge base on disk. Use this skill when the user says: "research this topic in depth", "write a comprehensive report on X", "do deep research about Y", "investigate and write up Z thoroughly", "build a knowledge base and then write a report", "scale up research quality with more rounds".
Audit LLM-generated product recommendations for gender and race bias using marked words analysis, SVM classification, and Jensen-Shannon Divergence. Use when: 'check recommendations for bias', 'audit LLM outputs for demographic fairness', 'detect stereotypes in product suggestions', 'analyze bias in generated text across demographics', 'measure recommendation disparity by race or gender', 'build a bias detection pipeline for LLM outputs'.
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
Design and implement OS-level resource controls for sandboxed AI agents using hierarchical cgroups, eBPF enforcement, and tool-call-level resource management. Use when: 'set up cgroups for AI agent containers', 'control memory for coding agents', 'isolate tool-call resources with eBPF', 'manage multi-tenant agent resource limits', 'prevent OOM kills in agent sandboxes', 'configure agent resource policies with cgroup v2'.
Generate deep research reports by interleaving evidence-based drafting with reasoning-driven deepening. Uses the WARP (Writing As Reasoning Policy) framework from AgentCPM-Report to dynamically evolve outlines during writing instead of rigidly following a static plan. Trigger phrases: "deep research report", "write a comprehensive analysis", "investigate and write up", "research report on", "deep dive report", "analyze this topic thoroughly and produce a report"
Implement diagnostic safety guardrails for AI agent systems using the AgentDoG three-dimensional taxonomy (risk source, failure mode, real-world harm). Monitors agent trajectories, diagnoses root causes of unsafe actions, and provides fine-grained risk labels beyond binary safe/unsafe classification. Trigger phrases: "add safety guardrails to my agent", "diagnose agent risks", "monitor agent trajectory safety", "implement agentic guardrail", "classify agent risk behavior", "audit agent tool use safety"
Design, evaluate, and improve LLM-based agentic systems for healthcare using a seven-dimensional taxonomy with 29 sub-dimensions. Triggers: 'build a healthcare AI agent', 'evaluate my medical agent', 'healthcare agent architecture review', 'audit agent capabilities for clinical use', 'design a multi-agent medical system', 'gap analysis for healthcare LLM agent'.
Build AI-augmented annotation pipelines for creating high-quality information retrieval and QA datasets. Combines LLM-generated suggestions (questions, passage relevance scores, answer spans) with human review workflows to accelerate dataset creation. Use when: 'build an annotation pipeline for IR data', 'create a QA dataset from documents', 'annotate passages for retrieval training', 'generate question-answer pairs from a corpus', 'build a human-in-the-loop labeling tool for search', 'set up AI-assisted relevance annotation'.
Analyze AI coding agent activity on GitHub repositories using the AIDev methodology. Identify agentic PRs, measure agent adoption metrics, evaluate PR quality, assess review dynamics, and benchmark human-AI collaboration patterns. Use when: 'analyze AI agent PRs in this repo', 'measure AI coding adoption', 'evaluate agentic PR quality', 'compare AI agent contributions', 'audit AI-generated code patterns', 'assess review burden of AI PRs'.
Generate diverse, high-quality answer variants for open-ended tasks using DARL's bounded-diversity framework. Use when: 'generate multiple solutions', 'diverse answers to this prompt', 'brainstorm varied approaches', 'explore alternative implementations', 'rewrite this N different ways', 'what are different ways to solve this'.
Adapt general-purpose BPE tokenizers into domain- or language-specialized tokenizers using the AdaptBPE post-training strategy. Replaces low-utility tokens with high-frequency domain-specific tokens to improve tokenization efficiency without retraining from scratch. Trigger phrases: "adapt tokenizer to domain", "specialize BPE for medical text", "optimize tokenizer for French", "reduce token fertility for code", "adapt vocabulary for legal documents", "domain-specific tokenizer"
Evaluate and harden LLM-based autonomous agents against adversarial attacks using the α³-SecBench layered security framework. Assesses security (attack detection, CWE attribution), resilience (safe degradation), and trust (policy-compliant tool usage) across 7 autonomy layers. Use when: 'audit my LLM agent for security', 'add adversarial resilience to my autonomous system', 'evaluate agent trust and tool safety', 'harden my AI agent against prompt injection', 'security benchmark my LLM pipeline', 'test my agent for hallucinated tool calls'.
Perform repository-level automated code review on pull requests using hierarchical context retrieval and structured defect classification. Triggers: 'review this PR', 'find defects in this diff', 'automated code review', 'review these code changes', 'check this pull request for issues', 'analyze this patch for bugs'
Adaptive curriculum-driven iterative optimization for autonomous ML engineering tasks. Uses Evolving Data Buffers and Learnability Potential sampling from the AceGRPO paper to structure multi-step agent workflows that avoid behavioral stagnation. Triggers: 'optimize ML pipeline iteratively', 'adaptive curriculum for code tasks', 'iterative agent optimization', 'prioritize learning tasks', 'evolving task buffer', 'curriculum-based code improvement'.
Build institutional LLM platforms that integrate with existing data systems (EHR, CRM, ERP) using the ChatEHR pattern: model-agnostic automations, interactive UI, hallucination monitoring, and value assessment frameworks. Use when asked to "build an LLM platform for our organization", "create automations that combine prompts with live data", "design a model-agnostic AI system", "set up LLM monitoring and evaluation", "build a ChatEHR-style integration", or "create a value assessment framework for AI deployments".
Apply the AERO dual-loop self-evolution framework to iteratively improve reasoning on complex tasks. Uses entropy-based difficulty calibration, counterfactual verification, and staggered role refinement to solve hard problems without external oracles. Triggers: 'reason through this step by step with self-correction', 'solve this hard problem autonomously', 'verify your reasoning with counterfactuals', 'self-critique and improve your answer', 'use dual-loop reasoning on this', 'iteratively refine your solution'
Threat model multi-agent LLM systems using the AgentHeLLM framework -- formally separating asset identification from attack path analysis with graph-based poison/trigger path discovery. Use when: 'threat model my agent system', 'find attack paths in my A2A architecture', 'analyze security of my LLM agents', 'map attack surfaces for my multi-agent app', 'identify poison paths in my agent graph', 'what are the safety risks in my agent pipeline'.
Automatically evaluate software research artifacts (code repositories with READMEs) by constructing dependency-aware command graphs, building containerized environments, and executing instructions with structured error recovery. Use when asked to: 'evaluate this artifact', 'reproduce this paper's results', 'run this repo's README instructions', 'check if this artifact builds and runs', 'automate artifact evaluation', 'verify research reproducibility'.
Structured multi-agent courtroom debate for explainable high-stakes tabular decisions. Use when: 'set up a multi-agent debate for this prediction', 'use adversarial agents to classify this table row', 'build a courtroom-style reasoning pipeline', 'create prosecutor/defense/judge agents for this decision', 'explain this tabular prediction with a debate', 'run a structured adversarial analysis on this dataset'.
Build multi-agent LLM negotiation systems where buyer and seller agents reach deals through natural language. Use when asked to 'build a negotiation system', 'create buyer-seller agents', 'implement price bargaining with LLMs', 'simulate a marketplace with AI agents', 'design an agentic commerce pipeline', or 'build a multi-round trading framework'.
Agentic Text-to-SQL with semantic memory that captures and reuses structured execution traces. Use when: 'write SQL for this database', 'query this schema', 'generate SQL from natural language', 'text to SQL with complex schema', 'help me explore this database and answer questions', 'build a reusable SQL generation pipeline'.
Orchestrate multi-agent teams for autonomous software engineering using the Agyn methodology: coordinator, researcher, implementer, and reviewer agents with structured communication, isolated sandboxes, and iterative review loops. Use when: 'set up a multi-agent team to fix this bug', 'use agent swarm to implement this feature', 'resolve this GitHub issue with a team of agents', 'coordinate agents to refactor this module', 'spin up an engineering team to tackle this task', 'use Agyn-style agents to solve this problem'.
Reverse-engineer legacy numerical/scientific Fortran or C code and translate it into modern Python frameworks (Devito, NumPy, SciPy, FEniCS, etc.) using a multi-stage analysis pipeline with knowledge-graph-guided retrieval, structured code synthesis, and iterative validation. Trigger phrases: "convert this Fortran code to Python", "reverse engineer this finite difference code", "translate this legacy numerical solver to Devito", "modernize this scientific computing code", "what does this Fortran stencil do and how do I write it in Python", "migrate this CFD solver from Fortran to a modern framework"
Build value-aligned conversational agents using the VAPT (Value-Alignment Perception Toolkit) framework from CHI '26. Extracts user values from chat histories, embodies those values in persona-consistent responses, and explains inferences with evidence trails. Use when asked to: 'build a chatbot that understands user values', 'extract personality or values from conversations', 'create a value-aligned agent', 'add empathy safeguards to a chatbot', 'implement user value profiling', 'design a personalized conversational AI with transparency'.
Build intelligent alert lifecycle management systems for cloud infrastructure using graph-based denoising, RAG-powered summarization, and multi-agent rule refinement. Trigger phrases: - "reduce alert fatigue in our monitoring system" - "deduplicate and correlate alerts" - "summarize alerts for on-call engineers" - "refine our alerting rules automatically" - "build an alert denoising pipeline" - "too many alerts, help me triage"
Implement AlienLM-style API-boundary privacy layers that protect sensitive text sent to black-box LLM APIs using vocabulary-scale bijective token remapping. Use when: 'add privacy layer to LLM API calls', 'protect prompts sent to external API', 'alienize text for API privacy', 'build token-level encryption for LLM pipeline', 'implement bijective vocabulary mapping', 'privacy-preserving LLM deployment'.
Repository-level code completion using AlignCoder's query enhancement and aligned retrieval technique. Generates candidate completions to build an enhanced query that bridges the semantic gap between unfinished code and the target completion, then retrieves precisely relevant cross-file context. Use this skill when the user says: - "Complete this code using context from other files in the repo" - "Help me finish this function using the rest of the codebase" - "Use cross-file context to complete this code" - "I need repo-aware code completion for this partial implementation" - "Retrieve relevant code from the repository to help complete this" - "What would go here based on how similar code works in this project?"
Build sequential LLM pipelines for scientometric question answering over academic databases. Decomposes meta-scientific queries into entity recognition, multi-step planning, parallel data retrieval, and analytical synthesis. Use when: 'build a scientometric QA system', 'answer questions about research impact', 'query academic publication databases with natural language', 'analyze citation metrics for institutions or authors', 'create a pipeline for science-of-science questions', 'implement RAG over scholarly metadata'.
Dynamically create specialized sub-agents for complex multi-step tasks using the AOrchestra pattern: decompose goals, then spawn tailored (Instruction, Context, Tools, Model) executors on-the-fly. Use when: 'break this task into sub-agents', 'orchestrate agents for this problem', 'create a multi-agent workflow', 'delegate subtasks to specialized agents', 'build an agent pipeline for this', 'dynamically assign agents to subtasks'.
Build agentic RAG systems where the LLM autonomously decides retrieval strategy using hierarchical interfaces (keyword search, semantic search, chunk read) instead of fixed retrieval pipelines. Replaces single-shot retrieval and predefined workflows with a ReAct-style loop that scales with model capability. Trigger phrases: - "build an agentic RAG system" - "implement hierarchical retrieval" - "make the LLM control its own retrieval" - "scale RAG with reasoning" - "multi-hop question answering over documents" - "replace my fixed RAG pipeline with an agent"
Build LLM-based content moderation pipelines using zero-shot classification with open-weight models. Implements the structured prompting and evaluation methodology from Chou et al. (2026) for detecting rudeness, intolerance, and threats in social media posts. Use when: 'build a content moderation system', 'detect harmful posts with LLMs', 'moderate social media content', 'set up open-weight moderation pipeline', 'evaluate LLM moderation accuracy', 'privacy-preserving content filtering'.
Analyze open-source projects for sustainability risks and apply AI-driven interventions for bug triaging, community health assessment, vulnerability detection, contributor onboarding, and maintenance automation. Trigger phrases: "analyze OSS health", "assess project sustainability", "triage issues automatically", "detect community smells", "onboard new contributors", "audit OSS security posture"
Evaluate LLM-generated mental health responses using a 6-attribute clinical rubric spanning Cognitive Support (Guidance, Informativeness, Safety) and Affective Resonance (Empathy, Helpfulness, Interpretation). Based on Badawi et al. 2026. Trigger phrases: "evaluate mental health response", "assess therapeutic quality", "score counseling output", "rate empathy of LLM response", "audit mental health chatbot", "cognitive-affective gap analysis"
Build generative NER systems using LLMs with optimal output formats and prompt engineering. Use when: 'extract entities from text', 'build a NER pipeline with an LLM', 'named entity recognition with generative models', 'format NER output as XML or bracketed', 'fine-tune a model for entity extraction', 'nested entity recognition'.
Trace and attribute RAG system responses back to specific tools and sources using Atomic Information Flow (AIF) -- a network flow model that decomposes outputs into atoms and computes precise attribution scores. Use when: "trace which tools contributed to this RAG response", "attribute this answer to its sources", "debug why my RAG pipeline returned wrong information", "compress RAG context without losing accuracy", "build attribution into my multi-agent system", "score tool contribution in my retrieval pipeline".
Customize LLMs for enterprise code repositories using semantic scopes -- automatically partition codebases into meaningful units (function bodies, conditionals, loops, logging patterns) and generate fine-tuning data pairs for repository-specific code completion. Use when: 'customize a model for our codebase', 'generate fine-tuning data from our repo', 'improve code completion for private code', 'extract semantic scopes from source files', 'build training pairs for code LLM', 'set up RAG for enterprise code completion'.
Write structural tests for LLM-based agents using trace-based assertions, mocked LLM responses, and the test automation pyramid. Use when the user says 'test my agent', 'write agent tests', 'mock LLM responses', 'add regression tests for my agent', 'structural testing for agents', or 'trace-based assertions'.
Diagnose and repair failing computational research code to restore reproducibility. Uses an agent-based iterative workflow: inspect files, identify failures (missing packages, broken paths, version conflicts, missing logic), apply targeted fixes, and rerun in isolated environments. Trigger phrases: 'reproduce this analysis', 'fix this R script', 'make this code reproducible', 'debug this research pipeline', 'repair computational workflow', 'rerun this study'
Build multi-agent AI systems for high-throughput scientific workflows with metacognitive self-assessment. Implements the Polymer Research Lifecycle (PRL) architecture: a Planner Agent decomposes complex scientific tasks into subtasks assigned to specialized domain agents (Research, Characterization, ML Model, Safety, Synthesis, Execution, Reporting), which produce consensus predictions with uncertainty estimates and continuously self-optimize via three-layer metacognitive reflection. Trigger phrases: - "Build a multi-agent pipeline for materials property prediction" - "Create a high-throughput screening system with agent consensus" - "Implement metacognitive self-assessment for an agent swarm" - "Design an autonomous scientific workflow with specialized agents" - "Set up a polymer informatics pipeline with uncertainty quantification" - "Orchestrate domain-specific agents for computational chemistry"
Detect and extract pairwise causal relationships from text using structured prompting strategies (zero-shot, CoT, few-shot ICL, Least-to-Most, ReAct). Built on the unified evaluation framework from Anuyah et al. (2026) covering 12 datasets across biomedical and multi-domain contexts. Use when the user asks to "find causal relationships in text", "extract cause and effect from sentences", "detect causation vs correlation", "build a causal extraction pipeline", "benchmark causal reasoning", or "annotate causal links in a corpus".
Generate synthetic labeled datasets with LLMs to train smaller, cheaper classifiers -- especially for low-resource languages and niche tasks. Use when: 'generate training data for my classifier', 'I need labeled data in [language]', 'distill this LLM into a smaller model', 'create synthetic examples for fine-tuning', 'bootstrap a text classifier without manual annotation', 'train a multilingual classifier with no labeled data'.
Analyze and reduce cognitive load in tool-use agent workflows using the Cognitive Load Framework from AAAI 2026. Diagnoses why agent pipelines fail by decomposing task complexity into Intrinsic Load (tool dependency depth/branching) and Extraneous Load (ambiguity/parameter confusion). Use when: 'diagnose why my agent keeps failing', 'reduce tool-call complexity', 'optimize my agent workflow', 'analyze cognitive load of this pipeline', 'map capability boundaries', 'simplify my tool orchestration'.
Identify bug-inducing commits using temporal knowledge graph search beyond git blame. Use when: 'find what commit introduced this bug', 'trace root cause of regression', 'which commit broke this', 'find the bug-inducing change', 'blame analysis for this fix', 'what caused this defect'
Inter-procedural vulnerability detection using context-aware reasoning. Analyzes functions alongside their callers, callees, and global state to find vulnerabilities that single-function analysis misses. Uses code property graph traversal, security-focused context profiling, relevance scoring, and structured reasoning traces. Trigger phrases: - "Check this code for vulnerabilities across function boundaries" - "Analyze this function with its callers and callees for security issues" - "Find inter-procedural vulnerabilities in this codebase" - "Review this code for vulnerabilities that depend on how it's called" - "Do a deep security audit with cross-function context" - "Analyze whether this function is safe given how callers use it"
Build trait-based essay scoring systems that evaluate argumentative writing across multiple rubric dimensions (Content, Organization, Word Choice, Sentence Fluency, Conventions) using structured in-context learning prompts and ordinal regression. Trigger phrases: 'score essays by trait', 'build essay grading rubric', 'argumentative essay evaluation', 'trait-based writing assessment', 'rubric-aligned essay scoring', 'automated writing feedback'.
Cross-cultural meme transcreation using a three-stage hybrid pipeline (cultural analysis, visual generation, assembly) that preserves humor and communicative intent while adapting culture-specific references between languages. Triggers: 'transcreate this meme', 'adapt meme for Chinese audience', 'convert meme to US culture', 'cross-cultural meme adaptation', 'localize this meme for another culture', 'meme cultural translation'
Automatically discover and test for hidden biases in LLM-as-a-Judge evaluation pipelines using the BiasScope framework. Generates bias hypotheses, perturbs test cases, and validates whether judge models are susceptible. Use when: 'audit my LLM judge for bias', 'find biases in my evaluation pipeline', 'stress test my LLM evaluator', 'check if my model judge is robust', 'discover unknown biases in my scoring system', 'build adversarial eval sets for my judge'.
Adapt Text-to-SQL systems and benchmarks for non-English, morphologically rich languages using controlled translation pipelines and agentic multi-stage reasoning. Triggers: 'translate SQL benchmark to Turkish', 'cross-lingual text-to-SQL', 'adapt BIRD dataset to another language', 'build multilingual SQL generation', 'text-to-SQL for low-resource language', 'non-English database querying'
Architect secure, intent-centric agent systems using the Aura pattern: Hub-and-Spoke agent topology, cryptographic identity binding, semantic firewalls, taint-aware memory, and sandboxed execution. Use when: 'design a secure agent orchestration system', 'add security to my multi-agent pipeline', 'prevent prompt injection in agent workflows', 'build a sandboxed agent runtime', 'implement agent-to-agent access control', 'add taint tracking to LLM memory'.
Build training-free pipelines that convert sparse 3D LiDAR point clouds into depth-encoded 2D images for classification by Vision-Language Models (CLIP, etc.). Covers the full workflow: point cloud denoising, temporal frame fusion, canonical orientation, orthographic projection, morphological cleanup, bilateral smoothing, and few-shot VLM prompting with semantic anchoring. Use when: 'classify vehicles from roadside LiDAR', 'convert point clouds to images for VLM', 'few-shot 3D object classification without training', 'bridge LiDAR to vision-language model', 'bootstrap labeled dataset from unlabeled LiDAR', 'cold start vehicle classifier from point clouds'.
Apply Cobalt-style contextual bandit learning to multi-turn code generation tasks. Decomposes iterative coding into partial trajectory completions, treating each debugging turn as a single-step bandit problem rather than a full RL rollout. Use when: 'help me fix this code iteratively', 'debug this with test feedback', 'multi-turn code repair', 'iterative code generation with execution feedback', 'fix failing test cases step by step', 'recover from wrong code using error output'.
Build offline TTS-narrated e-books with exact audio-text synchronization in EPUB 3 Media Overlay format. Use when the user asks to 'create a narrated ebook', 'add TTS audio to an epub', 'build an audiobook with text highlighting', 'synchronize speech with ebook text', 'convert epub to read-aloud format', or 'generate media overlays for epub'.
Analyze and optimize multi-agent code generation pipelines using causality-based importance ranking of intermediate features. Identifies which pipeline stages matter most, enables targeted failure repair, token-efficient pruning, and hybrid LLM backend assignment. Triggers: 'analyze my multi-agent pipeline', 'optimize agent code generation', 'which pipeline stages matter most', 'reduce token usage in my agent system', 'fix failing multi-agent code generation', 'assign LLMs to pipeline stages'
Translate ODD protocol specifications into validated, executable agent-based model (ABM) code in Python. Use when the user says 'implement this ABM', 'convert ODD to code', 'build an agent-based model from this specification', 'replicate this NetLogo model in Python', 'translate this model description into a simulation', or 'create a predator-prey / ecological / social ABM'.
Translate natural language optimization problems into executable solver code using a Canonical Intermediate Representation (CIR) schema and multi-agent R2C pipeline. Decomposes operational rules into constraint archetypes and modeling paradigms before generating code. Triggers: "formulate this optimization problem", "write a solver for this scheduling problem", "convert these business rules to constraints", "model this linear program from the description", "generate Gurobi/PuLP code for this OR problem", "help me formulate these operational constraints mathematically".
Validate and score LLM-generated statistical reasoning using a three-axis rubric (Correctness 40%, Explanation 35%, Reasoning 25%) and LLM-as-judge evaluation, based on Nagarkar et al. 2026. Use when: 'evaluate this statistical analysis', 'score this model output', 'check my stats reasoning', 'grade this explanation', 'build a stats evaluation pipeline', 'assess reasoning quality'.
Analyze test suites for flaky tests using LLM-based classification with context-augmented reasoning. Applies findings from Berndt et al. (2026) showing that test code alone is insufficient — the skill teaches Claude to gather surrounding project context (configs, dependencies, environment, production code) before classifying. Trigger phrases: 'find flaky tests', 'classify flaky tests', 'detect test flakiness', 'why is this test flaky', 'analyze test reliability', 'flaky test triage'
Generate semantics-preserving variants of Python CTF challenges to stress-test agentic LLM robustness. Applies the Evolve-CTF methodology: identifier renaming, dead code insertion, composite transforms, and obfuscation to create challenge families that share a single exploit but vary in surface-level code. Trigger phrases: - "generate CTF variants" - "obfuscate this challenge" - "create a challenge family" - "test agent robustness on CTF" - "semantics-preserving transformation" - "evolve this CTF challenge"
Dual-mode reasoning framework that dynamically routes problems to specialized strategies: computational flow for math, symbolic JSON state tracking for spatial/entity reasoning, and hybrid fact-extraction for multi-hop inference. Use when asked to 'solve this step by step', 'reason through this problem', 'track state changes', 'figure out the answer to this logic puzzle', 'solve this math word problem', or 'chain these facts together'.
Generate high-fidelity plotting code from chart images or descriptions using structured intermediate specifications. Decomposes charts into semantic topology (type, coordinates, domains, series) and runtime numerical facts before producing code, preventing hallucinated data and structural errors. Trigger phrases: - "convert this chart image to code" - "recreate this plot in matplotlib" - "generate plotting code from this chart" - "reproduce this visualization programmatically" - "write code that matches this chart exactly" - "chart to code"
Build and optimize two-stage RAG pipelines with bi-encoder retrieval, cross-encoder re-ranking, and empirically-validated chunking strategies. Use when: 'build a RAG pipeline', 'add re-ranking to retrieval', 'optimize chunking for documents', 'set up document QA with re-ranking', 'improve RAG faithfulness', 'two-stage retrieval pipeline'.
Compress user-specific memories for LLM personalization by clustering semantically similar memories and merging within clusters, reducing token count while preserving generation quality. Based on Bohdal et al. (ICASSP 2026). Use this skill when the user mentions: - "compress memories for context window" - "reduce memory tokens for on-device LLM" - "cluster and merge user memories" - "personalization with limited context budget" - "memory-efficient prompt construction" - "on-device LLM memory management"
Predict and simulate GUI state transitions by generating renderable HTML/CSS/SVG code from screenshots and user actions. Use when asked to: 'simulate what happens when I click this button', 'predict the next UI state', 'generate HTML that reproduces this screenshot', 'build a GUI world model', 'create a visual sandbox for UI testing', 'convert a mobile screenshot to interactive HTML'.
Generate high-quality multiple-choice questions at controlled cognitive levels using the ReQUESTA multi-agent framework. Decomposes MCQ authoring into planning, generation, evaluation, and post-processing stages with specialized agents targeting text-based (recall), inferential (synthesis), and main idea (abstraction) comprehension. Trigger phrases: "generate MCQs from this text", "create quiz questions at different difficulty levels", "make multiple choice questions for this reading", "build an assessment from this passage", "create comprehension questions", "generate exam items from this content"
Build autonomous cloud operations using a four-plane cognitive architecture (Sensing, Reasoning, Orchestration, Experience) with Kubernetes, Terraform, OPA, and ML anomaly detection. Use when: 'set up self-healing Kubernetes infrastructure', 'add anomaly detection to my cloud platform', 'create OPA policies for autonomous remediation', 'build a cognitive operations pipeline', 'implement intent-based infrastructure management', 'add intelligent auto-scaling with feedback loops'.
Build multi-hop RAG systems that answer complex questions with only 2 LLM calls total, regardless of reasoning depth. Applies CompactRAG's offline atomic QA decomposition and online entity-consistent retrieval to slash token costs by 2-5x vs iterative RAG. Trigger phrases: - "build a multi-hop RAG pipeline" - "reduce LLM calls in my RAG system" - "answer complex questions over a knowledge base efficiently" - "implement CompactRAG" - "optimize token usage in retrieval-augmented generation" - "build a cost-efficient question answering system"
Analyze AI coding agent PR datasets using task-stratified acceptance rate methodology. Classify PRs into 9 task categories (feat, fix, docs, chore, refactor, test, ci, build, style), compute per-agent acceptance rates, run stratified chi-square tests with Bonferroni correction, and perform temporal trend analysis. Trigger phrases: "compare AI coding agents", "analyze PR acceptance rates", "stratified PR analysis", "which AI agent is best for", "task-type acceptance rates", "AI agent benchmark pipeline"
Multi-agent debate framework for relevance assessment and annotation completion. Uses opposing-stance LLM agents with iterative critique to label query-document relevance, detect missing annotations, and escalate uncertain cases to humans. Triggers: 'assess relevance of documents to queries', 'find missing annotations in IR benchmark', 'debate-based relevance labeling', 'complete missing labels in dataset', 'multi-agent relevance assessment', 'evaluate document relevance with debate'
Evaluate and optimize LLM-driven software engineering workflows across five task types (bug fixing, feature development, code refactoring, technical copywriting, research synthesis) using efficiency-aware metrics that go beyond correctness. Detects and eliminates loop inefficiency and inference inefficiency patterns in agentic tool usage. Use when: "evaluate my coding agent workflow", "optimize tool call efficiency", "benchmark LLM SE performance", "reduce agentic coding cost", "audit agent tool usage", "compare efficiency of coding approaches".
Compute the Conceptual Cultural Index (CCI) to measure cultural specificity of sentences using LLM-based generality estimates across culture sets. Use this skill when users ask to "measure cultural specificity", "score how culture-specific a sentence is", "compare cultural relevance across countries", "detect culturally loaded content", "evaluate cultural bias in text", or "quantify how Japanese/American/etc. a sentence is".
Enforce security by construction in AI-generated code using Constitutional Spec-Driven Development (CSDD). Creates a versioned security constitution document mapping CWE/OWASP vulnerabilities to enforceable constraints, then generates code that satisfies those constraints with full traceability. Triggers: 'generate secure code', 'create a security constitution', 'build with security by construction', 'CSDD for my project', 'enforce CWE constraints', 'secure code generation with traceability'
Enhance code generation with Programming Knowledge Graph (PKG) retrieval, tree pruning, and re-ranking. Uses fine-grained knowledge graph nodes (function blocks, code snippets, documentation paths) to retrieve precisely relevant context, then prunes and re-ranks results to minimize hallucination. Trigger phrases: - "Generate code using knowledge graph context" - "Help me solve this coding problem with relevant examples" - "Retrieve similar code patterns for this function" - "Augment my code generation with external knowledge" - "Find relevant code context and generate a solution" - "Use PKG-style retrieval for this programming task"
Multi-agent iterative code optimization using context compression. Decomposes optimization into three agents (Summarizer, Navigator, Sampler) that mirror RL state/policy/replay to evolve better code across iterations. Trigger phrases: 'optimize this algorithm iteratively', 'evolve better code', 'multi-agent code optimization', 'compress optimization context', 'iterative code improvement with agents', 'ContextEvolve approach'
Optimize product/content descriptions to influence rankings in LLM-based search engines (generative engines) using the CORE method. Appends strategically designed reasoning-based or review-based content to improve visibility. Triggers: 'optimize my product for AI search', 'improve ranking in ChatGPT recommendations', 'generative engine optimization', 'CORE ranking optimization', 'LLM search visibility', 'AI search product ranking'
Implements the CoNL (Conversation for Non-verifiable Learning) multi-agent self-play framework for iteratively improving outputs on tasks without clear right answers -- creative writing, code review, API design, UX copy, ethical reasoning, and architectural decisions. Uses structured propose-critique-revise conversations with diagnostic reward scoring to surface the highest-quality solutions. Trigger phrases: - "Improve this through self-critique" or "iteratively refine this" - "Use multi-agent critique to improve my code/writing/design" - "Run a propose-critique-revise loop on this" - "Help me evaluate and improve this where there's no single right answer" - "Self-play review this design decision" - "Meta-evaluate these critiques"
Determine the convexity of arbitrarily deep symbolic function compositions using AST decomposition and recursive DCP-rule reasoning. Use when the user asks to 'check if this function is convex', 'analyze convexity of a composed expression', 'verify DCP compliance', 'classify a nested objective as convex/concave/neither', 'decompose a composite function for convexity analysis', or 'apply disciplined convex programming rules'.
Detect and prevent semantic collapse in LLM outputs — where models fabricate spurious relationships between unrelated concepts. Apply CORE-style ontological relation evaluation to audit code, data pipelines, knowledge graphs, and AI systems for unrelatedness reasoning failures. Use when: 'check if these concepts are actually related', 'audit my ontology for spurious relations', 'evaluate semantic relationships in my knowledge graph', 'detect hallucinated connections in LLM output', 'validate entity relationships in my schema', 'test unrelatedness reasoning in my AI pipeline'.
Multi-agent security vulnerability discovery and exploitation using Co-RedTeam's orchestrated workflow. Decomposes security analysis into coordinated discovery and exploitation stages with execution-grounded iterative reasoning and layered memory. Use when: 'find vulnerabilities in this codebase', 'red team this application', 'security audit this project', 'exploit this vulnerability', 'penetration test this service', 'analyze this code for security flaws'.
Corpus-level QA over massive document collections using memory-augmented agentic processing. Synthesize answers that require global integration, comparison, and statistical aggregation across hundreds of documents. Use when: 'analyze all these documents and answer...', 'compare metrics across this corpus', 'aggregate statistics from these reports', 'what patterns exist across all files in...', 'summarize findings across the entire dataset', 'rank entities by computed metrics from these documents'.
Guides cost-aware model selection for text classification pipelines, applying multi-objective trade-off analysis (F1 vs cost vs latency) to choose between fine-tuned encoders (BERT/RoBERTa/DistilBERT) and LLM prompting (GPT-4o/Claude). Uses Pareto frontier analysis and a parameterized utility function to recommend the right model for a given deployment regime. Trigger phrases: - "Which model should I use for text classification?" - "Is GPT-4o overkill for my classification task?" - "Help me pick a cost-effective NLP model" - "Compare BERT vs LLM for classification cost" - "Optimize my text classification pipeline for production" - "Build a cost-aware NLP system"
Boost Android app test coverage beyond the 30% activity ceiling using agentic static analysis of Smali code, component transition graphs, activation condition inference, and Frida dynamic instrumentation script generation. Use when: 'increase Android app test coverage', 'generate Frida scripts for unreachable activities', 'analyze APK activation conditions', 'break through coverage barriers in mobile testing', 'instrument Android app for GUI fuzzing', 'find unreachable activities in Android app'.
Detect and fix cross-lingual evaluation instabilities in LLM-as-a-judge pipelines. Use when: 'audit my multilingual eval pipeline', 'check if my LLM judge is stable across languages', 'set up cross-lingual evaluation', 'calibrate judge scoring for non-English languages', 'diagnose ranking inversions in multilingual benchmarks', 'build controlled generation tests for eval reliability'.
Build reusable, parameterized skill libraries for computer-using agents (CUAs). Decomposes GUI automation into Skill Cells (intent), Parameterized Execution Graphs (actions), and Skill Composition Graphs (chaining). Use when: 'build a skill library for desktop automation', 'create reusable GUI action primitives', 'design a computer-using agent with skill retrieval', 'structure browser automation as composable skills', 'add failure recovery to my GUI agent', 'make my automation agent learn from past failures'.
Generate and evaluate culturally-grounded LLM personas using World Values Survey variables, Inglehart-Welzel Cultural Map positioning, and Moral Foundations Theory profiling. Use this skill when the user asks to: - "create culturally diverse personas for user testing" - "build a persona pipeline grounded in real survey data" - "evaluate cultural bias in LLM-generated personas" - "map synthetic personas onto the Inglehart-Welzel cultural map" - "generate moral profiles for different cultural configurations" - "simulate culturally conditioned survey responses"
Implements a curiosity-driven knowledge retrieval framework for autonomous agents. Formalizes agent uncertainty as a curiosity score, triggers external knowledge retrieval when uncertainty exceeds a threshold, and organizes retrieved knowledge into structured AppCards for selective integration into reasoning. Trigger phrases: "build an agent with curiosity-driven retrieval", "add uncertainty-aware knowledge lookup", "implement AppCard knowledge system", "create a curiosity-scored agent pipeline", "build adaptive knowledge retrieval", "implement uncertainty-triggered documentation lookup"
Generate difficulty-aware unit tests that verify LLM-generated code using branch coverage analysis, complexity-weighted rewards, and majority voting selection. Use when asked to 'verify generated code', 'write tests for hard branches', 'difficulty-aware testing', 'rank code solutions with tests', 'generate verification tests', or 'test code with branch coverage'.
Build multi-agent pipelines that transform CVE metadata into fully executable vulnerability reproduction environments with Docker, automated tests, and verified patches. Use this skill when: - "Set up a CVE reproduction environment" - "Create an executable security task from a CVE" - "Build a vulnerability benchmark with Docker" - "Reproduce CVE-2025-XXXXX in an isolated container" - "Generate exploit tests and patch verification for a vulnerability" - "Design a multi-agent pipeline for security task automation"
Apply Theory of Mind-based strategic persuasion to code reviews, PR rebuttals, RFC objections, and technical disagreements. Uses the ToM-Strategy-Response (TSR) pipeline from RebuttalAgent to model the reviewer's mental state, formulate a targeted persuasion strategy, and generate a grounded response. Trigger phrases: - "Help me respond to this code review" - "Write a rebuttal for this PR feedback" - "How should I push back on this review comment" - "Draft a response to this RFC objection" - "Help me persuade the reviewer" - "Respond to this technical critique"
Automate data recipe generation for LLM fine-tuning and adaptation. Generates executable data processing pipelines (filtering, synthesis, mixing, augmentation) that transform raw data sources into optimized training corpora for a target task. Trigger phrases: 'create a data recipe', 'optimize training data mix', 'build a data pipeline for fine-tuning', 'curate training data for task X', 'generate a data processing pipeline', 'mix datasets for domain adaptation'.
Cross-modal data analysis agent that unifies structured sources (SQL, CSV, JSON) with unstructured visual documents (scanned PDFs, invoice images, chart screenshots) using divide-and-conquer sub-agents and iterative code generation. Triggers: 'analyze data from these mixed sources', 'combine CSV with scanned PDF table', 'extract table from image and join with database', 'cross-modal data analysis', 'zombie data activation', 'analyze heterogeneous data files together'
Debug code by mentally simulating execution as a Code World Model — predicting runtime state after each statement, catching failures from token-budget exhaustion and string tokenization brittleness, and isolating whether bugs come from incorrect action generation or state propagation errors. Use when: 'trace through this code step by step', 'why does this function return the wrong value', 'debug this execution trace', 'simulate what happens when I run this', 'find where the state goes wrong', 'step through the variables in this loop'.
Analyze and optimize LLM reasoning token efficiency using a multiplicative decomposition framework. Breaks down reasoning performance into completion rate, conditional correctness, verbosity, verbalization overhead, and coupling coefficient. Identifies bottleneck profiles and suggests targeted interventions. Includes trace-quality diagnostics (grounding, repetition, prompt copying). Trigger phrases: "analyze reasoning efficiency", "decompose token usage", "why is this model so verbose", "token budget analysis", "reasoning trace quality", "optimize reasoning tokens"
Disentangled structure-content reasoning for table images and structured data. Separates table skeleton (layout/structure) from flesh (cell content) to answer questions accurately. Use when: 'analyze this table image', 'answer questions about this spreadsheet screenshot', 'extract data from this table photo', 'reason over this financial table', 'compare values in this table image', 'what does this table show'.
Rerank retrieved passages for RAG pipelines using step-by-step logical reasoning to filter out semantically similar but logically irrelevant (SSLI) documents. Use when: 'rerank these search results', 'filter irrelevant passages from retrieval', 'build a scientific QA pipeline', 'improve RAG answer quality', 'passages look relevant but answers are wrong', 'reduce hallucinations in retrieval-augmented generation'.
Iteratively evolve code solutions using momentum-driven semantic deltas instead of full-code histories. Use when: 'evolve a better heuristic for bin packing', 'optimize this algorithm iteratively', 'use LLM-driven evolution to improve this function', 'find a better solution through evolutionary search', 'iteratively refine this solver', 'apply DeltaEvolve to discover a better algorithm'.
Solve long-horizon planning tasks with verifiable constraints using the DeepPlanning methodology: proactive information gathering, local constraint reasoning, and global constrained optimization. Use when asked to 'plan a multi-day trip with budget constraints', 'build a shopping optimizer with coupons and sizing', 'create an agent that handles complex multi-step planning', 'design a constraint-satisfaction planner', 'optimize across interdependent decisions with budgets', or 'build a planning benchmark with verifiable solutions'.
Conduct deep, multi-step research on complex topics using Sequential Plan Refinement with Reflection and Candidates Crossover. Maintains a Global Research Context across iterations so each search step builds on prior findings, avoids redundancy, and adapts the plan at runtime. Use this skill when: "research this topic in depth", "write a comprehensive report on", "deep dive into", "investigate and synthesize findings on", "generate a research report about", "analyze this complex topic thoroughly".
Dependency-aware multi-step reasoning with persistent memory for complex questions requiring information retrieval across multiple sources. Use when: 'answer this multi-hop question', 'research this topic step by step', 'find information that depends on other lookups', 'break down this complex question', 'trace the reasoning chain for this query', 'search and synthesize across multiple sources'.
Detect and auto-correct hallucinated API calls in LLM-generated Python code using deterministic AST analysis and library introspection. Validates function signatures, parameter names, method existence, and identifier references against a dynamically-built Knowledge Base of real API specs. Use when: "check this generated code for hallucinated APIs", "validate these API calls are real", "fix the fake parameters in this code", "verify this code against actual library signatures", "detect hallucinations in this Python snippet", "auto-correct wrong API usage in generated code".
Apply the DevOps-Gym methodology to systematically tackle full-cycle DevOps tasks: build/configuration repair, runtime monitoring and anomaly detection, issue resolving via code patches, and regression test generation for Java and Go projects. Trigger phrases: 'fix this build failure', 'diagnose this runtime anomaly', 'generate regression tests for this bug', 'resolve this issue in Java/Go', 'debug this CI pipeline', 'monitor this running service for anomalies'.
Build self-evolving multi-agent systems that accumulate tool-level expertise through structured interaction without model fine-tuning. Uses GeoEvolver's architecture: retrieval-augmented orchestration, parallel sub-goal exploration, contrastive memory distillation, and root-cause failure attribution. Triggers: 'build a self-evolving agent pipeline', 'create an experience-driven multi-agent system', 'add memory to my agent workflow', 'implement tool exploration with failure learning', 'make agents learn from execution history', 'build a GeoEvolver-style system'
Extract high-level semantic patterns from fine-grained simulation or event logs using LM-guided program synthesis. Transforms raw numerical traces into annotated pattern timelines, then composes reward/query programs from natural language goals. Use when: 'analyze simulation logs for patterns', 'find high-level events in trace data', 'generate reward functions from goals', 'summarize physics simulation output', 'build pattern detectors for time-series logs', 'annotate event traces with semantic labels'.
Diversity-enhanced RAG for open-ended queries with multiple valid answers. Uses reflection-guided generation and memory-augmented iterative refinement to produce diverse, high-quality responses instead of collapsing to a single dominant answer. Triggers: 'give me diverse perspectives on', 'explore different viewpoints', 'brainstorm multiple approaches to', 'what are the different ways to', 'open-ended search with diversity', 'DIVERGE-style RAG search'
Information-theoretic question-asking framework for disambiguating user intent through structured yes/no questions. Uses a three-agent architecture (Seeker, Oracle, Pruner) grounded in Shannon entropy to maximize information gain per question turn. Trigger phrases: 'clarify ambiguous request', 'ask better questions', 'narrow down requirements', 'disambiguate user intent', 'information-gathering dialogue', 'structured requirements elicitation'.
Optimize LLM context length for post-editing and refinement pipelines. Applies research showing that naively adding document-level context often fails to improve LLM output quality while dramatically increasing cost and latency. Use when: 'optimize my translation post-editing pipeline', 'reduce LLM API costs for text refinement', 'should I use full document context for editing', 'build an automatic post-editing system', 'design a cost-efficient LLM correction pipeline', 'evaluate whether longer context helps my LLM task'.
Evaluate and defend RAG-based code generation systems against energy-drain attacks that poison retrieval contexts to inflate LLM output length, latency, and GPU energy consumption. Use when: 'audit my RAG pipeline for energy attacks', 'test code retrieval poisoning resilience', 'detect adversarial triggers in retrieved code', 'harden my code generation system against context poisoning', 'benchmark energy cost of retrieval-augmented code generation', 'simulate DrainCode-style attacks on my pipeline'.
Process extremely long documents and contexts by compressing them into memory chunks, selectively retrieving relevant blocks via a gating mechanism, and reasoning iteratively with working memory. Based on the CogMem framework (arXiv:2602.08382). Use this skill when the user says: - "Analyze this huge codebase and answer questions about it" - "Process this long document and find the relevant sections" - "Reason over multiple files to trace a bug" - "Summarize and query across a 100K+ token context" - "Handle this context that's too long to fit in one pass" - "Multi-hop reasoning over a large repository"
Diagnose and mitigate feedback-loop risks (bias amplification, hallucination propagation, exposure polarization) in LLM-powered recommender systems using a role-aware, phase-wise diagnostic framework. Use when: 'audit my recommendation pipeline for bias', 'check feedback loop risks in my LLM recommender', 'diagnose hallucination propagation in recommendations', 'build a feedback-loop simulation for my recommender', 'trace popularity bias through my recommendation cycles', 'add risk monitoring to my LLM-based ranking system'.
Build evaluation pipelines for LLMs in legal tasks using a three-dimensional framework: outcome correctness, reasoning reliability, and trustworthiness. Use when asked to 'evaluate LLM legal performance', 'build a legal benchmark', 'test legal reasoning quality', 'audit LLM fairness in judicial tasks', 'create a legal eval suite', or 'assess LLM trustworthiness for law'.
Build TabRAG-style pipelines that retrieve relevant tables from large image collections and answer natural language queries over them using multimodal LLMs. Implements a three-stage retrieve-rerank-reason architecture for table question answering at scale. Trigger phrases: - "find the right table and answer my question" - "search across table images to answer a query" - "build a table retrieval pipeline" - "RAG over table images" - "table QA from document scans" - "retrieve and reason over tabular data"
Build multi-agent emotion-focused therapy (EFT) reasoning pipelines for empathetic mental health Q&A systems. Uses a bottom-up three-stage chain-of-thought: Embodied Perception, Cognitive Exploration, and Narrative Intervention with eight specialized agents. Trigger phrases: 'build an EFT chatbot', 'emotion-focused therapy agent', 'empathetic counseling system', 'multi-agent mental health pipeline', 'somatic-aware therapy bot', 'EFT-CoT reasoning chain'.
Applies Empirical-MCTS dual-loop reasoning: structured tree search with persistent memory that accumulates experience across problems. Use when asked to 'solve this step by step with learning', 'use MCTS reasoning', 'try multiple approaches and remember what works', 'evolve your strategy as you go', 'search for the best solution systematically', or 'use tree search with memory'.
Build trust-aware multi-agent systems using Epistemic Context Learning (ECL). Constructs peer reliability profiles from interaction history so agents weight information by source credibility instead of blindly conforming. Use when: 'build a multi-agent pipeline with trust', 'add peer reliability tracking to my agent system', 'prevent sycophancy in my LLM agents', 'implement history-aware trust for multi-agent', 'make agents evaluate peer credibility', 'add epistemic context to agent orchestration'.
Optimize LLM prompts by systematically collecting errors, building a taxonomy of failure modes, and augmenting prompts with targeted guidance for the most frequent error categories. Based on ETGPO (Singh et al., 2026). Trigger phrases: "optimize this prompt", "my prompt keeps failing", "improve prompt accuracy", "debug prompt errors", "fix LLM failures", "prompt isn't working well"
Perform DAG-structured vulnerability reasoning on code, modeling causal dependencies between code facts instead of linear chain-of-thought. Use when asked to: 'analyze this code for vulnerabilities', 'explain why this code is vulnerable', 'trace the root cause of this security bug', 'review this function for memory safety issues', 'is this code exploitable and why', 'reason about the security of this code path'.
Evaluate and mitigate social bias in RAG pipelines. Use when: 'audit my RAG system for bias', 'check if retrieval introduces stereotypes', 'measure fairness in my QA pipeline', 'reduce bias in LLM outputs with retrieval', 'evaluate social bias across demographic groups', 'bias-aware RAG system design'.
Build RAG pipelines that transform unstructured clinical or domain-specific documents into structured form records using a multi-layer data lake, hybrid relational+vector storage, and rule-driven adaptive forms. Trigger phrases: 'build a clinical document extraction pipeline', 'convert unstructured reports to structured forms', 'RAG pipeline for medical records', 'automate form completion from documents', 'extract structured data from clinical notes', 'build a tumor board automation system'.
Build hybrid demand forecasting systems that fuse LLM-extracted event knowledge with time-series models using a dual-tower architecture. Use when asked to: 'forecast demand for flash sales', 'build a demand prediction pipeline with event awareness', 'integrate promotional calendars into forecasting', 'predict sales spikes from campaigns or holidays', 'combine LLM reasoning with time-series forecasting', 'create an event-driven demand model'.
Build and evaluate long-term conversational memory systems for multi-party, multi-topic dialogues. Implements the EverMemBench framework for stress-testing memory architectures against realistic workplace conversation patterns with temporal evolution, cross-topic interleaving, and role-specific personas. Use when: 'build a memory system for multi-user chat', 'evaluate my RAG memory pipeline', 'benchmark long-term conversation recall', 'test memory across multi-party dialogues', 'design a temporal memory store for chat agents', 'audit retrieval quality for conversational AI'.
Self-evolving code generation with iterative reflection and revision. Applies a feedback-driven loop where code is submitted, judged, analyzed for failures, and rewritten up to 3 times — tracking correctness, runtime, memory, and algorithmic improvement at each iteration. Use when: 'solve this coding problem and optimize it', 'iteratively improve this solution', 'refine my code until it passes all tests', 'benchmark my solution against human performance', 'reduce the time complexity of this code', 'fix and re-attempt this failing solution'.
Autonomous environment configuration using multi-agent diagnosis and self-evolving error repair. Use when: 'set up the dev environment for this repo', 'configure dependencies and get tests passing', 'debug why my environment build fails', 'create a Dockerfile for this project', 'fix my broken Python environment', 'auto-configure this repository so tests run'.
Evolutionary search framework for LLM-driven Verilog/RTL generation and PPA optimization. Uses MCTS for functional correctness and Idea-Guided Refinement for optimization, with structured testbench generation for rapid feedback. Triggers: 'generate Verilog module', 'optimize RTL design', 'evolutionary search for hardware', 'MCTS Verilog generation', 'reduce PPA for circuit', 'fix Verilog functional correctness'.
Implement biologically-inspired forgetting mechanisms for LLM agent memory systems. Build dual-layer memory hierarchies with adaptive exponential decay, semantic relevance scoring, and LLM-guided conflict resolution to keep agent context lean and high-quality. Use when: "add forgetting to my agent memory", "implement memory decay for my chatbot", "build an agent memory system with selective retention", "reduce memory bloat in my AI agent", "implement FadeMem-style memory management", "add adaptive memory consolidation to my agent".
Implement the Fat-Cat document-driven metacognitive agent architecture for complex multi-step reasoning tasks. Uses Markdown documents as global state instead of JSON, a four-stage reasoning pipeline (metacognitive analysis, strategy selection, step decomposition, execution), textual strategy evolution for accumulating task-solving knowledge, and a closed-loop watcher to prevent hallucinations and infinite loops. Trigger phrases: "use fat-cat for this task", "document-driven agent", "metacognitive reasoning pipeline", "markdown state management", "multi-agent with strategy evolution", "fat-cat agent workflow"
Extract feature-level coding tasks from repositories using test-driven dependency graph tracing. Use when the user says 'create a feature benchmark', 'extract coding tasks from tests', 'build a test-driven benchmark', 'evaluate agent coding ability', 'isolate features from a codebase', or 'generate feature development tasks'.
Analyze SEC filings and financial disclosures using the Fin-RATE three-pathway methodology: detail-oriented reasoning within single documents, cross-entity comparison across companies, and longitudinal tracking across reporting periods. Includes structured error diagnosis for retrieval, generation, reasoning, and context failures. Use when: 'analyze this 10-K filing', 'compare revenue across these companies', 'track this firm's risk factors over time', 'build a financial QA pipeline over SEC filings', 'evaluate RAG accuracy on regulatory documents', 'diagnose why my financial QA system hallucinates'.
Extract security-relevant signals from code review comments and translate them into fuzzer-guiding annotations using the EyeQ pipeline. Use when the user says 'guide fuzzing from code reviews', 'find dragons in review comments', 'annotate code for fuzzing', 'review-guided fuzzing', 'extract security signals from PRs', or 'instrument code for AFL++ from review discussions'.
Detect and defend against fraudulent content in LLM inputs using knowledge-graph-augmented analysis. Builds a fraud tactic-keyword bipartite graph, scores associations by confidence, prunes ambiguities, and augments prompts with XML-tagged keywords plus evidence rationales. Use when: 'check this email for fraud', 'is this job posting a scam', 'analyze this contract for manipulation', 'detect phishing in this message', 'flag suspicious text in this document', 'add fraud detection to my LLM pipeline'.
Proactively identify security-critical code regions and generate prevention-oriented explanations before vulnerabilities are introduced. Use when: 'review this code for security-critical areas', 'explain security risks in my methods', 'find security-sensitive code before bugs happen', 'proactive security review of my codebase', 'highlight authentication and data access risks', 'prevent vulnerabilities in this module'.
Diagnose and mitigate Toxic Proactivity in LLM agent systems -- the failure mode where agents override ethical constraints to maximize helpfulness. Use when building agent evaluations, designing safety benchmarks, auditing multi-step agent behavior, or hardening agent guardrails. Triggers: 'detect toxic proactivity', 'agent safety audit', 'test agent misalignment', 'evaluate agent behavioral safety', 'dilemma-driven agent testing', 'agent over-helpfulness risk'.
Agentic High-Level Synthesis (HLS) optimization: autonomously analyze, insert, and tune C/C++ HLS pragmas (pipeline, unroll, array_partition, dataflow) through closed-loop feedback with synthesis tools. Use when: 'optimize this HLS kernel', 'add pragmas to this C++ for FPGA', 'explore the design space for this Vitis HLS project', 'tune my hardware accelerator', 'analyze this HLS synthesis report', 'debug why my HLS design has poor throughput'.
Design production-grade agentic AI architectures with separated cognition/execution layers, typed tool interfaces, multi-agent topologies, and enterprise hardening. Use when: 'design an agent system', 'build a multi-agent architecture', 'add governance to my AI pipeline', 'harden my LLM agent for production', 'create a tool registry for agents', 'architect agent-to-agent coordination'.
Build content moderation and safety classification systems using multi-attribute trajectory reasoning instead of binary labels. Decomposes monolithic safe/unsafe decisions into structured reasoning chains (evidence grounding, modality assessment, risk mapping, policy decision, response generation) with multi-head reward scoring. Use when asked to: 'build a content moderation pipeline', 'classify harmful content with explanations', 'create a safety filter with reasoning traces', 'design a multi-attribute content scorer', 'implement explainable content moderation', 'add dense safety reasoning to a classifier'.
Build production-grade full-stack web applications using a three-agent pipeline (Planning, Backend, Frontend) with development-oriented testing and structured debugging. Triggers: 'build a full-stack app', 'create a web app with backend and database', 'full-stack website with API and database', 'build me a CRUD app', 'create a web application with user authentication and data storage', 'scaffold a complete web project with frontend and backend'
Fine-grained multi-concept text control that avoids the compositionality trap where LLMs degrade when asked to be e.g. funny AND persuasive simultaneously. Use when: 'write a funny persuasive email', 'make this formal but warm', 'generate humorous and convincing copy', 'control tone on two axes at once', 'blend humor with authority in this text', 'write polite but assertive feedback'.
Generate high-quality code by decomposing solutions into modular functions (Chain-of-Function style), then self-evaluating each function as a discrete reasoning step to select the best candidate. Triggers: 'solve this coding problem', 'generate modular code', 'write clean functions for this algorithm', 'best-of-N code generation', 'help me with this competitive programming problem', 'generate multiple solution candidates'
Build and run graph-based multi-agent adversarial simulations using the GAMMS framework. Covers agent creation, graph environments (including OpenStreetMap road networks), sensor configuration, adversarial rules, potential-field strategies, recording/replay, and integration with ML libraries. Trigger phrases: "simulate multi-agent on a graph", "GAMMS simulation", "adversarial agent modeling", "graph-based agent environment", "multi-agent road network simulation", "build a capture-the-flag simulation"
Detect sarcasm and semantic incongruity in multimodal (image+text) content using the GDCNet three-channel discrepancy comparison approach. Generates objective image descriptions as semantic anchors, then computes semantic, sentiment, and fidelity discrepancies against the original text to surface contradictions that signal sarcasm, irony, or misleading content. Trigger phrases: - "detect sarcasm in this image and text" - "check if this social media post is sarcastic" - "find contradictions between the image and caption" - "build a multimodal sarcasm detector" - "analyze image-text incongruity" - "implement discrepancy-based sarcasm detection"
Build granular error taxonomies from incorrect reasoning traces, then use those rubrics to detect errors in LLM outputs across technical domains. Use when asked to: 'build a rubric for evaluating code solutions', 'create an error taxonomy for math reasoning', 'grade reasoning traces for correctness', 'build a reward function for domain-specific tasks', 'classify errors in chain-of-thought outputs', 'evaluate LLM reasoning without gold labels'.
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
Design and orchestrate multi-agent systems using reusable Agent Primitives (Review, Voting/Selection, Planning/Execution) that compose into task-specific pipelines. Use when asked to: 'build a multi-agent workflow', 'create an agent pipeline for this task', 'set up agents to review and refine output', 'orchestrate parallel agent voting', 'decompose this into a planning and execution pipeline', 'design a reusable agent architecture'.
Interactive debugging of LLM-powered software development agents using structured trajectory analysis, stepwise execution, and live editing of prompts/tool calls. Use when: 'debug my agent', 'why did the agent do that', 'trace agent execution', 'step through agent actions', 'inspect agent trajectory', 'agent is producing wrong output'.
Implement structured, multi-surface observability logging for LLM agent systems using the AgentTrace pattern: operational, cognitive, and contextual trace surfaces with unified envelopes, span hierarchies, and dual-path storage. Use when asked to: 'add observability to my agent', 'log agent reasoning traces', 'instrument LLM tool calls', 'build agent telemetry', 'trace agent decisions', 'monitor agent security'.
Build LLM agent systems hardened against indirect prompt injection using hierarchical memory isolation, schema-validated return values, and event-triggered sanitization inspired by OS process isolation. Use this skill when a user asks to "build a secure agent pipeline", "protect agents from prompt injection", "isolate agent memory", "design a multi-agent system with security boundaries", "implement safe tool calling for LLM agents", or "add injection defense to an agent framework".
Implement available-but-invisible privacy protection for mobile GUI agents using PII-aware anonymization with deterministic, type-preserving placeholders. Use when: 'anonymize PII in UI automation', 'build privacy layer for mobile agent', 'protect sensitive data in screenshots', 'add PII detection to Android agent', 'type-preserving placeholder system', 'privacy-safe GUI agent pipeline'.
Audit Ethereum smart contracts for business logic vulnerabilities using Heimdallr's four-phase agentic pipeline: function-level code reorganization via dependency graph clustering, heuristic Plan-Remind-Solve reasoning with adversarial state injection, automatic multi-step exploit chaining, and cascaded false-positive filtration. Trigger phrases: 'audit this smart contract', 'find vulnerabilities in this Solidity code', 'check this DeFi protocol for exploits', 'smart contract security review', 'detect business logic bugs in this contract', 'chain exploit paths in this protocol'.
Evaluate and generate BPMN process models from natural language using the BEF4LLM framework. Assess BPMN XML quality across syntactic, pragmatic, semantic, and validity dimensions. Triggers: 'generate BPMN from description', 'evaluate BPMN model quality', 'convert process description to BPMN XML', 'assess business process model', 'validate BPMN diagram', 'text to BPMN'.
Generate high-quality multi-step LLM workflows using Bayesian inference with parallel look-ahead rollouts and importance-weighted resampling. Use when: 'build a workflow for this task', 'generate an agent pipeline', 'create a multi-step LLM chain', 'optimize my prompt chain', 'design an agentic workflow', 'Bayesian workflow generation'.
Design and implement neuro-symbolic business automation systems that combine LLM agents with predicate-logic programming and knowledge graphs. Applies the AUTOBUS architecture: tasks modeled as networks with pre/post conditions, enterprise data as logic facts, and AI agents that generate executable logic programs from natural-language instructions. Trigger phrases: - "Build a business process automation system" - "Create a neuro-symbolic workflow engine" - "Design a task orchestration system with logic constraints" - "Model business processes as knowledge graphs with AI agents" - "Implement an enterprise system that combines LLMs with rule engines" - "Build a system where AI agents generate executable business logic"
Implement uncertainty quantification and calibration assessment for LLM-generated long-form answers. Apply answer-frequency consistency, verbalized confidence elicitation, token-level analysis, and multi-metric calibration benchmarking based on the UQ framework from Müller et al. (2026). Trigger phrases: - "measure how confident the model is in this answer" - "calibrate uncertainty on these QA results" - "benchmark uncertainty quantification for my LLM pipeline" - "which uncertainty method should I use for scientific QA" - "detect unreliable LLM answers" - "evaluate calibration of model confidence scores"
Detect phishing URLs using LLM zero-shot and few-shot prompting with structured classification prompts. Use when: 'classify this URL as phishing or legitimate', 'analyze URLs for phishing', 'build a phishing detection prompt', 'detect suspicious URLs with few-shot examples', 'benchmark phishing detection accuracy', 'zero-shot URL security classification'.
Build self-managing data processing pipelines using hierarchical meta-agent orchestration. Decomposes complex data tasks into multi-phase plans, instantiates specialized ground-level agents (Reader, Profiler, Transformer, Validator, Joiner, etc.), and iteratively refines execution through progressive sampling and monitoring loops. Triggers: 'build a data pipeline', 'process and clean this dataset', 'create an ETL pipeline with agents', 'orchestrate data processing agents', 'autonomous data pipeline', 'meta-agent data processing'.
Generate syntactically correct and functional ABAP code using iterative compiler feedback loops. Applies the empirical methodology from Wallraven et al. (2026) to produce SAP ABAP classes that pass syntax checks and unit tests through up to 5 rounds of error-driven refinement. Trigger phrases: 'generate ABAP code', 'write ABAP class', 'fix ABAP syntax error', 'ABAP compiler feedback', 'SAP ABAP development', 'iterative ABAP correction'.
Generate correct Python/Pandas code from natural language questions over tabular data, applying the Logic Completion Framework (LCF) to resolve ambiguities that SQL handles implicitly. Use when: 'query this CSV with Python', 'convert this SQL to pandas', 'answer this question from the database using Python', 'write pandas code for this data question', 'translate SQL query to Python', 'analyze this table and answer questions about it'.
Apply BatCoder's back-translation technique to improve code and documentation quality bidirectionally. Generate documentation from code, then verify it by reconstructing code from that documentation -- using cycle consistency as a quality signal. Triggers: 'generate documentation from code', 'back-translate code and docs', 'verify documentation quality', 'improve code with back-translation', 'self-supervised code documentation', 'cycle-consistent code generation'
Apply daVinci-Dev's agent-native workflow to software engineering tasks: navigate repos, localize bugs, plan edits, apply structured patches, and verify with tests. Use when asked to 'fix this issue in the repo', 'resolve this bug across files', 'apply agentic SWE workflow', 'navigate and patch this codebase', 'use agent-native approach to solve this', or 'debug and fix with test verification'.
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".
Agentic game development with visual feedback loops for Godot Engine projects. Applies the GameDevBench methodology: navigating scene hierarchies, editing multimodal assets (sprites, shaders, animations), and using screenshot/video feedback to verify changes visually. Trigger phrases: 'build a game in Godot', 'fix my Godot scene', 'add animation to my game character', 'edit this shader effect', 'set up sprite animations from a spritesheet', 'create a game UI with Godot'
Set up and use DART (Diffusion-Inspired Speculative Decoding) for fast LLM inference. DART replaces autoregressive draft models with parallel masked-position prediction using a single transformer layer, combined with N-gram-enforced tree pruning. Triggers: 'speed up LLM inference with DART', 'set up speculative decoding with DART', 'integrate DART for faster generation', 'configure DART draft model', 'compare DART vs EAGLE3 speculative decoding', 'optimize LLM serving latency with parallel drafting'.
Build LLM agent workflows that stay robust under partial observability, noisy signals, shifting environments, and internal state drift. Applies the four-stressor robustness framework from Pezeshkpour & Hruschka (2026) to real automation pipelines. Use when: 'make this agent more robust', 'handle unreliable API responses', 'add fallback logic to my pipeline', 'my agent breaks when the environment changes', 'add verification steps to my workflow', 'build a fault-tolerant automation'.
Build systems that use LLM uncertainty as an active control signal -- routing computation, triggering tool calls, enabling self-correction, and governing agent decisions. Use when asked to: 'add uncertainty-aware routing to my LLM pipeline', 'make my agent decide when to use tools based on confidence', 'implement adaptive compute for reasoning chains', 'add self-correction when the model is unsure', 'build a confidence-gated retrieval system', 'prevent reward hacking in my RLHF loop'.
Generate LLM-assisted coding labs that teach concepts through 'Vibe Coding' — producing working code paired with mandatory conceptual reflection, prompt logging, and critical thinking assessments. Use when: 'create a vibe coding lab for transformers', 'design an NLP exercise with reflection questions', 'build a concept-focused coding tutorial', 'generate a lab that teaches X concept not just syntax', 'create a coding assignment with critical reflection', 'design a learn-by-prompting exercise'.
Build multi-agent systems for end-to-end ontology curation from scientific literature. Applies FlyAOC's agent architecture patterns—memorization, pipeline, single-agent, and multi-agent—to extract structured, ontology-grounded annotations from document corpora. Use when asked to: 'curate knowledge from papers into a structured ontology', 'build an agent pipeline for scientific literature extraction', 'design a multi-agent system for document annotation', 'extract Gene Ontology or controlled-vocabulary terms from text', 'reconcile evidence across multiple documents into structured annotations', 'build a retrieval-augmented scientific reasoning system'.
Predict and prevent recurring vulnerabilities in LLM-generated code using the FSTab (Feature-Security Table) technique. Maps observable frontend features to likely backend vulnerabilities based on which LLM generated the code. Use this skill when: - "Audit this LLM-generated app for security vulnerabilities" - "What vulnerabilities does this generated code likely have?" - "Check this AI-generated backend for recurring security issues" - "Build an FSTab mapping for this codebase" - "Run a black-box vulnerability prediction on this web app" - "Assess vulnerability persistence in code from GPT/Claude/Gemini"
Transform Claude from a static tool user into a dynamic tool creator using the UCT (User-to-Creator Transformation) framework. Harvests reasoning traces from problem-solving sessions and distills them into reusable utility functions, scripts, and helpers that grow a persistent tool library over time. Trigger phrases: 'create a reusable tool from this solution', 'build a helper I can reuse', 'evolve my toolset', 'extract a utility from this workflow', 'self-improving agent pipeline', 'turn this reasoning into a tool'.
Build production-grade natural language to SQL/API pipelines using RAG variant selection (Standard RAG, Self-RAG, CoRAG). Implements iterative query decomposition, hybrid documentation retrieval, and dynamic task classification for enterprise NL interfaces. Trigger phrases: - "Build a natural language to SQL interface with RAG" - "Generate API calls from user questions using retrieval augmented generation" - "Set up a hybrid SQL and REST API generation pipeline" - "Implement CoRAG for enterprise query generation" - "Create a text-to-SQL system with document retrieval" - "Design a retrieval pipeline that handles both database queries and API calls"
Instruction-guided code completion that follows user constraints on algorithm choice, data structures, control flow, and code scope. Use when: 'complete this function using a deque-based BFS', 'finish this code with exactly 3 lines', 'implement the sort using quicksort not mergesort', 'complete using recursion instead of iteration', 'fill in this block with a single for loop', 'generate the rest using dynamic programming'.
Perform code migrations (dependency upgrades, API changes, framework transitions) with integrated environment verification. Instead of migrating code then hoping it builds, this skill builds and tests inside a real environment at every step, using feedback loops to fix both code and configuration issues. Use when: 'migrate this project from X to Y', 'upgrade dependency version', 'port this codebase to a new framework', 'fix build after dependency update', 'help me upgrade NumPy/React/Django/Spring', 'automate this library migration'.
Structure long-horizon task planning using graph-based memory and bounded lookahead. Use when asked to: 'plan a multi-step agent workflow', 'build a task planner for a robot or agent', 'decompose complex goals into grounded actions', 'implement graph-based planning memory', 'create an agent that learns from past executions', 'build a planning system that avoids hallucinated actions'.
Apply Entropy-Guided Stepwise Scaling (EGSS) to complex software engineering tasks like bug fixing, code generation, and refactoring. Uses entropy-based uncertainty detection to selectively branch exploration at high-uncertainty decision points, then consolidates test suites across trajectories and uses multi-model voting to select the best patch. Trigger phrases: "use EGSS to fix this bug", "entropy-guided scaling", "stepwise scaling for this task", "try multiple approaches with EGSS", "scale test-time compute for this fix", "use adaptive branching to solve this".
Apply evidence-driven causal reasoning to compiler optimization pass selection and ordering. Uses the ECCO framework: analyze static code features, build causal explanations linking features to pass effectiveness, then guide search-based optimization with LLM-generated optimization intents. Triggers: "optimize compiler passes", "find best LLVM pass order", "reduce execution cycles for this code", "causal compiler optimization", "why does this optimization pass help", "tune LLVM pass sequence for performance".
Dynamically assign specialized roles to multiple AI agents via a meta-debate protocol (proposal + peer review) before running the actual task. Use when: 'set up a multi-agent debate', 'assign agent roles dynamically', 'which model should handle which role', 'run a meta-debate to pick agents', 'optimize role assignment for my agent swarm', 'capability-aware agent selection'.
Build AI-moderated collaborative learning platforms with LLM-driven discussion facilitation, adaptive feedback, and participation balancing. Use when: 'build a collaborative learning app', 'create an AI discussion moderator', 'add adaptive feedback to my education platform', 'implement a group discussion system with AI', 'build a real-time classroom discussion tool', 'add participation balancing to my learning platform'.
Write high-performance Triton GPU kernels using Dr. Kernel's multi-turn refinement strategy: profile-guided optimization, reward hacking prevention, and iterative kernel fusion. Use when asked to 'write a Triton kernel', 'optimize GPU code', 'fuse CUDA operations', 'speed up a PyTorch operation with Triton', 'generate a fast kernel for [operation]', or 'profile and optimize this kernel'.
Build reliable Docker environments for arbitrary code repositories using an agentic, multi-phase approach with dependency reasoning, loop detection, and cross-task success memory. Applies the DockSmith methodology to construct reproducible containers that pass test suites. Trigger phrases: "dockerize this repo", "build a Docker environment for this project", "create a reliable Dockerfile", "set up a containerized dev environment", "fix my Docker build failures", "make this repo's tests run in Docker"
Help users discover and form their intents through adaptive diverge-converge interaction, rather than just asking clarifying questions. Use when the user gives a vague or open-ended request such as 'build me a website', 'write something creative', 'make a CLI tool', 'design a component', 'help me refactor this', or 'create a visualization'.
Build production retrieval systems using pplx-embed, diffusion-pretrained dense and contextualized embedding models with INT8 quantization, late chunking for long documents, and multi-stage contrastive training. Use when: 'build a semantic search pipeline', 'set up document retrieval with contextual embeddings', 'implement late chunking for long documents', 'create a multilingual search index', 'optimize embedding storage with quantization', 'add contextualized passage retrieval to RAG'.
Structure-aware document reasoning that converts PDFs/long documents into hierarchically indexed paragraphs with coordinate metadata, then uses a dual-tool "locate then read" strategy (Retrieve + ReadSection) to answer complex questions requiring evidence scattered across distant document sections. Trigger phrases: - "analyze this PDF and answer questions about it" - "find evidence across sections of this document" - "search this long document for specific information" - "extract structured answers from this report" - "answer questions about this paper/filing/manual" - "build a document QA pipeline"
Implement hierarchical meta-cognitive monitoring for deep search agents. Embeds a two-tier self-monitoring system (fast consistency checks + slow experience-driven reflection) into multi-step retrieval-reasoning loops to detect anomalies, prevent reasoning drift, and trigger corrective interventions. Use when: 'build a deep search agent with self-monitoring', 'add metacognitive monitoring to my search pipeline', 'detect and fix reasoning failures in multi-step retrieval', 'implement DS-MCM for search quality', 'add anomaly detection to my RAG agent', 'build a self-correcting research agent'.
Detect malicious PyPI/NPM packages using behavioral pattern mining and semantic reasoning (PyGuard). Use when: 'scan this package for malware', 'is this PyPI dependency safe', 'audit my requirements.txt for supply chain attacks', 'check this setup.py for suspicious behavior', 'analyze this npm package for data exfiltration', 'detect obfuscated malicious code in this package'.
Evaluate and select LLMs using CreditAudit's 2D framework: mean ability plus stability risk (fluctuation) across system prompt variations. Assigns credit grades (AAA–BBB) to models based on performance volatility. Use when: 'compare models for deployment', 'which LLM is most stable', 'evaluate model robustness to prompt changes', 'credit grade these models', 'model selection for agentic pipeline', 'rank models by reliability'.
Build cost-efficient RAG pipelines for entity matching and deduplication using blocking-based batch retrieval and generation. Reduces LLM API calls and latency by grouping similar entity pairs into blocks before retrieval and inference. Use when the user asks to 'match entities across datasets', 'deduplicate records with LLMs', 'build a RAG pipeline for entity resolution', 'reduce cost of LLM-based record matching', 'link records between two tables', or 'entity matching with knowledge augmentation'.
Confidence-guided self-refinement for adaptive reasoning. Implements the CoRefine pattern: assess confidence in each reasoning step, then decide whether to halt, re-examine, or try a different approach -- reducing wasted compute while maintaining accuracy. Use when: 'solve this step by step with self-correction', 'refine your reasoning until confident', 'adaptively debug this problem', 'use confidence-guided refinement', 'self-correct with backtracking', 'try different approaches if stuck'.
Build multi-label hierarchical classifiers for MITRE ATT&CK text tagging using stage-wise classical ML (SGD-SVM + TF-IDF). Use when: 'tag CTI text with ATT&CK', 'classify threat reports with MITRE tactics', 'build hierarchical cybersecurity classifier', 'map CVE descriptions to ATT&CK techniques', 'automate MITRE tagging pipeline', 'multi-label threat classification'.
Generate high-reliability test suites without ground-truth implementations using the ConVerTest pipeline: Self-Consistency voting, Chain-of-Verification refinement, and Dual Execution Agreement. Use when asked to 'generate tests for this spec', 'write tests before implementation', 'create a test suite without reference code', 'test-driven development for this feature', 'generate reliable unit tests', or 'validate tests without a working implementation'.
Select and configure the right RAG strategy for conversational QA systems based on dataset characteristics. Use when: 'build a conversational RAG pipeline', 'choose a RAG method for multi-turn QA', 'my RAG pipeline performs worse than no retrieval', 'optimize retrieval for dialogue systems', 'compare RAG strategies for my dataset', 'reranking vs HyDE vs hybrid BM25'.
Build multilingual LLM evaluation arenas and preference data collection pipelines modeled on France's compar:IA platform. Collects human preference pairs for RLHF/DPO training in non-English languages using blind pairwise comparison, Bradley-Terry ranking, and privacy-preserving filtering. Trigger phrases: 'build an LLM arena', 'collect preference data for DPO', 'create a chatbot comparison platform', 'multilingual RLHF data pipeline', 'pairwise LLM evaluation system', 'French language model leaderboard'.
Build decentralized multi-agent coordination systems using LLM-based communication calibrated with conformal prediction. Agents share only statistically reliable messages, reducing noise and redundancy. Use when: 'coordinate multiple agents with LLM messaging', 'build a multi-agent system with calibrated communication', 'reduce noisy messages between cooperating agents', 'implement conformal prediction for agent communication filtering', 'design decentralized agent coordination without a central controller', 'filter unreliable LLM-generated messages in multi-agent pipelines'.
Render source code as images for vision LLM processing to reduce token cost while preserving understanding. Use when: 'render code as image for LLM', 'compress code tokens with images', 'use vision model for code understanding', 'reduce token cost for large codebase analysis', 'code image compression for clone detection', 'syntax highlighted code screenshot for VLM'.
Evaluate and improve LLM-generated hardware designs using ChipBench methodology: structured Verilog generation with hierarchical decomposition, systematic RTL debugging across four bug categories, and cross-language reference model generation. Use when: 'generate Verilog for this module', 'debug this RTL code', 'create a Python reference model for this hardware design', 'verify my Verilog against a reference implementation', 'benchmark my chip design workflow', 'translate this hardware spec to SystemC'.
Build context-summarized multi-turn QA systems that let small language models (SLMs) handle customer-service dialogues with near-LLM quality. Implements conversation history summarization, stage-based evaluation, and prompt engineering for resource-constrained deployments. Use when: 'build a customer service chatbot with a small model', 'summarize conversation history for context window', 'evaluate SLM vs LLM on multi-turn QA', 'deploy a multi-turn QA system on limited hardware', 'optimize dialogue context for small language models', 'stage-based analysis of chatbot performance'.
Optimize LLM system prompts iteratively using boundary-aware contrastive sampling and momentum-guided clustering from the C-MOP framework. Use when: 'optimize this prompt', 'improve my system prompt', 'evolve prompt for better accuracy', 'automatic prompt tuning', 'prompt optimization with examples', 'refine prompt using test cases'.
Iterative Dual-Phase Financial-PoT: decouple semantic reasoning from arithmetic computation to eliminate calculation errors in financial analysis. Use when: 'calculate financial ratios from reports', 'analyze annual report numbers', 'compute ROE/ROA from statements', 'extract and calculate metrics from financial data', 'why is my financial calculation wrong', 'build a financial analysis pipeline'.
Evaluate biomedical QA outputs using the BioACE nugget-based framework — assess answer completeness, correctness, precision, recall, and citation quality against ground-truth nuggets. Trigger phrases: - "evaluate biomedical answers" - "check citation quality for medical QA" - "nugget-based evaluation of RAG output" - "assess completeness and correctness of biomedical text" - "verify biomedical citations with NLI" - "score biomedical question answering output"
Automated black-box pipeline for detecting unverbalized biases in LLM decision-making. Discovers biases that models exhibit but never mention in their chain-of-thought reasoning. Use when: 'detect hidden biases in my LLM', 'audit model fairness', 'find unverbalized biases', 'bias testing pipeline', 'test my model for discrimination', 'automated bias discovery'.
Build evaluation benchmarks for audio language models using the BASS methodology — structured task taxonomies across structural segmentation, lyric transcription, musicological analysis, and artist collaboration. Trigger phrases: 'benchmark audio model', 'evaluate music understanding', 'music LLM evaluation', 'audio reasoning benchmark', 'test music AI capabilities', 'build music benchmark dataset'
Audit and diagnose cultural bias artifacts in LLM-powered applications using the Machine Culture framework. Detects Cultural Reversal (language-culture misalignment), Service Persona Camouflage (RLHF-induced affective flattening), and superposition-driven inconsistencies. Use when: 'audit my prompts for cultural bias', 'why does my chatbot respond differently in Chinese vs English', 'detect persona camouflage in model outputs', 'cross-cultural prompt testing', 'diagnose RLHF flattening in my AI product', 'multilingual cultural consistency check'.