skills/domains/cs/llm-aiops-guide/SKILL.md
Papers on LLMs for IT operations and AIOps research
npx skillsauth add wentorai/research-plugins llm-aiops-guideInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A curated collection of research on applying LLMs to IT Operations (AIOps) — log analysis, anomaly detection, incident management, root cause analysis, and automated remediation. Tracks how foundation models are transforming traditional rule-based operations tooling into intelligent, adaptive systems. Relevant for CS researchers at the intersection of systems, NLP, and operations.
LLM for AIOps
├── Log Analysis
│ ├── Log parsing (template extraction)
│ ├── Anomaly detection (from log sequences)
│ ├── Log summarization
│ └── Root cause from logs
├── Incident Management
│ ├── Incident triage and routing
│ ├── Severity classification
│ ├── Similar incident retrieval
│ └── Resolution recommendation
├── Root Cause Analysis
│ ├── Topology-aware diagnosis
│ ├── Multi-signal correlation
│ └── Causal inference
├── Monitoring & Alerting
│ ├── Metric anomaly detection
│ ├── Alert correlation
│ ├── Noise reduction
│ └── Capacity planning
└── Automated Remediation
├── Runbook generation
├── Script generation
├── Self-healing systems
└── Change impact analysis
Production LLM monitoring dimensions:
QUALITY MONITORING
- Output quality scores: automated evaluation (LLM-as-judge, BERTScore, ROUGE)
- Hallucination rate: factual grounding checks against retrieval context
- Refusal rate: track over-cautious or under-cautious safety filters
- Latency percentiles: p50, p95, p99 for time-to-first-token and total generation
- Token usage: input/output token distributions, context window utilization
DRIFT DETECTION
- Input drift: embedding-space distribution shift (cosine distance, MMD)
- Output drift: topic/style distribution changes over time windows
- Performance drift: sliding-window accuracy on held-out evaluation sets
- Concept drift: monitor for domain vocabulary shifts in user queries
- Baseline comparison: periodically re-evaluate against golden test suites
OPERATIONAL HEALTH
- GPU utilization and memory pressure (per-device, per-replica)
- Request queue depth and timeout rates
- Cache hit rates (KV cache, semantic cache, prompt cache)
- Error rates by error category (OOM, context overflow, timeout, malformed output)
- Throughput: tokens/second per deployment, requests/minute
Designing valid A/B tests for LLM systems:
CHALLENGES UNIQUE TO LLMs
- High output variance: same prompt can produce different outputs
- Evaluation subjectivity: many tasks lack clear ground truth
- Latency-quality tradeoff: larger models are better but slower
- Cost confound: better model may cost 10x more per query
RECOMMENDED APPROACH
1. Define metrics BEFORE experiment:
- Primary: task-specific quality (accuracy, user satisfaction, resolution rate)
- Secondary: latency, cost per query, token efficiency
- Guardrail: safety violations, hallucination rate
2. Traffic splitting strategy:
- User-level randomization (not request-level) to avoid confusion
- Minimum 1-2 weeks for stable estimates
- Stratify by user segment (power users vs. new users)
3. Evaluation methods:
- Automated scoring with LLM-as-judge (calibrated against human raters)
- Blind human evaluation on sampled outputs (inter-rater agreement > 0.7)
- Downstream business metrics (ticket resolution time, user retention)
4. Statistical rigor:
- Bootstrap confidence intervals for LLM quality scores
- Account for multiple comparisons when testing many variants
- Report effect sizes, not just p-values
| Tool | Focus | Key Capabilities | |------|-------|-----------------| | MLflow | End-to-end ML lifecycle | Experiment tracking, model registry, deployment, LLM evaluation | | Weights & Biases | Experiment tracking + LLM monitoring | Traces, prompt versioning, evaluation tables, sweeps | | LangSmith | LLM application observability | Trace visualization, prompt playground, dataset management, online evaluation | | Comet ML | Experiment management | Model comparison, artifact tracking, LLM prompt tracking |
| Tool | Focus | Key Capabilities | |------|-------|-----------------| | vLLM | High-throughput serving | PagedAttention, continuous batching, tensor parallelism, speculative decoding | | TGI (Text Generation Inference) | Production serving | Quantization, streaming, multi-LoRA, watermarking | | Ollama | Local model running | Easy setup, model library, OpenAI-compatible API | | TensorRT-LLM | NVIDIA-optimized inference | FP8 quantization, in-flight batching, custom kernels | | SGLang | Structured generation serving | RadixAttention, constrained decoding, multi-modal support |
| Tool | Focus | Key Capabilities | |------|-------|-----------------| | LangChain / LangGraph | LLM application framework | Chains, agents, tool use, stateful multi-actor workflows | | Haystack | NLP pipeline framework | RAG pipelines, document processing, evaluation | | Prefect / Airflow | Workflow orchestration | DAG scheduling, retry logic, observability | | Ray Serve | Distributed serving | Auto-scaling, multi-model composition, batch inference |
End-to-end LLMOps pipeline:
┌─────────────────────────────────────────────────────────────────┐
│ DATA PREPARATION │
│ Raw data → Cleaning → Annotation → Train/Eval split │
│ Tools: Label Studio, Argilla, Lilac, DVC │
└──────────────────────────┬──────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│ MODEL DEVELOPMENT │
│ Base model selection → Fine-tuning (LoRA/QLoRA) → Evaluation │
│ Tools: Hugging Face Transformers, Axolotl, LLaMA-Factory │
│ Eval: lm-evaluation-harness, HELM, custom domain benchmarks │
└──────────────────────────┬──────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│ MODEL REGISTRY & CI │
│ Version control → Automated testing → Approval gates │
│ Tools: MLflow Registry, W&B Model Registry, HF Hub │
│ Tests: regression suite, safety checks, latency benchmarks │
└──────────────────────────┬──────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│ DEPLOYMENT │
│ Quantization → Containerization → Canary rollout → Full deploy │
│ Tools: vLLM, TGI, Docker, Kubernetes, Terraform │
│ Strategy: blue-green or canary with automatic rollback │
└──────────────────────────┬──────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│ PRODUCTION MONITORING │
│ Quality monitoring → Drift detection → Alerting → Feedback │
│ Tools: LangSmith, W&B Weave, Prometheus + Grafana, PagerDuty │
│ Loop: degradation detected → trigger re-evaluation → retrain │
└─────────────────────────────────────────────────────────────────┘
Reducing model size and inference cost:
QUANTIZATION METHODS
- GPTQ: Post-training quantization, good quality at 4-bit, widely supported
- AWQ (Activation-aware Weight Quantization): Better quality than GPTQ at 4-bit
- GGUF: CPU-friendly format, variable bit-width (Q4_K_M, Q5_K_M, Q8_0)
- FP8: NVIDIA H100/B200 native, minimal quality loss, 2x throughput vs FP16
- AQLM: Additive quantization, state-of-the-art at 2-bit
PRACTICAL GUIDANCE
- 8-bit: negligible quality loss for most tasks (~0.1% accuracy drop)
- 4-bit: slight quality loss, acceptable for many production uses (~1-3% accuracy drop)
- 2-3 bit: noticeable degradation, use only when cost is critical
- Always evaluate on YOUR task after quantization (general benchmarks can be misleading)
- Combine quantization with speculative decoding for further speedup
Multi-layer caching for LLM systems:
EXACT MATCH CACHE
- Hash the full prompt, return cached response for identical queries
- Hit rate: typically 5-15% for general-purpose, 30-60% for structured queries
- Tools: Redis, DragonflyDB, in-memory LRU
SEMANTIC CACHE
- Embed the prompt, return cached response for semantically similar queries
- Similarity threshold: 0.95+ cosine similarity (tune per use case)
- Tools: GPTCache, Redis with vector search, Qdrant
- Risk: semantically similar prompts may require different answers
KV CACHE OPTIMIZATION
- PagedAttention (vLLM): eliminates memory waste from pre-allocated KV cache
- Prefix caching: reuse KV cache for shared system prompts across requests
- Quantized KV cache: FP8 or INT8 KV values (H100+, ~2x context capacity)
PROMPT CACHING (API providers)
- Anthropic prompt caching: cache static prefix, pay reduced rate for cached tokens
- OpenAI cached context: automatic for repeated prefixes
- Design prompts with static prefix (system prompt, examples) + dynamic suffix (user query)
Cost-quality optimization through model routing:
TIERED MODEL ROUTING
- Simple queries → small/fast model (e.g., GPT-4o-mini, Claude Haiku, Llama-8B)
- Complex queries → large/capable model (e.g., GPT-4o, Claude Sonnet, Llama-70B)
- Critical queries → frontier model (e.g., o3, Claude Opus)
ROUTING STRATEGIES
1. Classifier-based: Train a small classifier on query complexity
- Features: query length, vocabulary complexity, domain signals
- Labels: which model tier produces acceptable quality
- Cost: classifier inference is negligible (<1ms, <$0.001)
2. Cascade (try-small-first):
- Route to cheapest model first
- Check output quality with a verifier
- Escalate to larger model if quality is insufficient
- Effective when >50% of queries are simple
3. Task-based routing:
- Summarization, translation → mid-tier model
- Code generation, math reasoning → high-tier model
- Classification, extraction → small model or fine-tuned specialist
EXPECTED SAVINGS
- Typical 40-70% cost reduction vs. routing everything to the best model
- Quality degradation: <5% when routing thresholds are properly calibrated
| Paper | Year | Focus | |-------|------|-------| | LogPPT | 2023 | Few-shot log parsing with prompt tuning | | OpsEval | 2024 | Benchmark for evaluating LLMs in AIOps | | D-Bot | 2024 | LLM-based database diagnosis | | RCAgent | 2024 | Agent for root cause analysis | | LogAgent | 2024 | Autonomous log analysis agent | | AIOpsLab | 2024 | Holistic benchmark suite for AIOps agents | | MonitorAssistant | 2024 | LLM-based alert correlation and noise reduction | | LLM4Ops Survey | 2024 | Comprehensive survey of LLMs for IT operations |
tools
10 document processing skills. Trigger: extracting text from PDFs, parsing references, document Q&A. Design: parsing pipelines (GROBID, marker) and structured extraction tools.
documentation
Guide to tldraw for infinite canvas whiteboarding and diagram creation
testing
Create graphical abstracts, schematic diagrams, and scientific illustrations
documentation
Create UML diagrams and architecture visualizations with PlantUML