skills/devops-gym-benchmarking-ai-agents/SKILL.md
Apply the DevOps-Gym methodology to systematically tackle full-cycle DevOps tasks: build/configuration repair, runtime monitoring and anomaly detection, issue resolving via code patches, and regression test generation for Java and Go projects. Trigger phrases: 'fix this build failure', 'diagnose this runtime anomaly', 'generate regression tests for this bug', 'resolve this issue in Java/Go', 'debug this CI pipeline', 'monitor this running service for anomalies'.
npx skillsauth add ndpvt-web/arxiv-claude-skills devops-gym-benchmarking-ai-agentsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill equips Claude to handle the four core DevOps workflow categories identified in the DevOps-Gym benchmark (arXiv:2601.20882): build and configuration, runtime monitoring, issue resolving, and test generation. Rather than treating these as isolated coding tasks, the methodology emphasizes sequential decision-making across the full DevOps cycle -- analyzing large-scale Java/Go projects, understanding dynamic runtime behavior, leveraging domain-specific build and monitoring tools, and producing verifiable outputs (patches, diagnostics, tests) that are validated against real execution.
The DevOps-Gym benchmark reveals that AI agents fail at DevOps tasks for three specific, addressable reasons: (1) toolchain knowledge gaps -- agents don't understand the internal mechanics of Maven, Gradle, goreleaser, and Go modules well enough to fix configuration issues; (2) premature convergence -- agents stop after partial fixes instead of running iterative fix-run-verify loops; (3) cross-language capability gaps -- performance drops dramatically from Python to Java/Go due to compiled-language complexity (multi-stage compilation, linking, type systems). The benchmark shows Claude Code achieving 58% on build tasks but only 14-24% on monitoring and test generation, meaning these harder categories require deliberate strategies.
The actionable methodology is structured verification: for every DevOps category, the agent must produce output in a specific format (diff patch, structured diagnostic, test file) and verify it against execution. Build patches must compile and pass tests. Monitoring diagnoses must cite quantitative evidence (memory growth rates, process IDs). Issue patches must pass fail-to-pass tests without regressions. Generated tests must fail on buggy code and pass on patched code. The key insight is that agents that enforce iterative fix-run-verify loops outperform those that attempt single-shot solutions.
For monitoring tasks specifically, the benchmark identifies four failure modes that agents must avoid: inadequate monitoring methodology (37% of failures) -- solved by systematic multi-tool sampling over time; premature conclusions (26%) -- solved by requiring temporal evidence across multiple observation windows; insufficient temporal granularity (11%) -- solved by collecting data at regular intervals; and interpretation failures (26%) -- solved by comparing against baselines before diagnosing anomalies.
Determine which of the four categories applies: build/config (compilation failures, dependency errors, toolchain issues), monitoring (runtime anomalies, performance degradation), issue resolving (bug description to code patch), or test generation (bug description to regression test). This determines the tool set, output format, and verification strategy.
For build tasks: read pom.xml, build.gradle, go.mod, CI config files, and recent build logs. For monitoring: use top, free -m, ps aux, netstat, iostat to capture baseline system state. For issue/test tasks: read the bug description, identify the affected module, and map the relevant source files and existing test suites.
For build failures: parse error messages to distinguish dependency conflicts, version mismatches, missing plugins, and toolchain incompatibilities. For monitoring: collect system metrics at 3+ time intervals to establish trends (e.g., monotonically increasing memory = leak, sustained >90% CPU = saturation). For issue resolving: trace the bug description to specific code paths using grep, call graph analysis, and test failure output.
Produce the smallest change that addresses the root cause. For build config: edit only the specific dependency version, plugin configuration, or build script line. For code patches: generate a unified diff that touches only the buggy logic. Avoid refactoring or unrelated improvements -- the DevOps-Gym evaluation penalizes patches that introduce new test failures.
This is the critical differentiator. After applying each change: (a) run the build/test/monitoring check, (b) analyze the output for remaining failures, (c) apply incremental fixes. Do NOT stop after the first attempt. The benchmark shows agents that iterate achieve significantly better results than single-shot approaches.
mvn clean install, gradle build, or go build -- must complete with exit code 0, and any associated test suite must pass.Java and Go introduce compilation stages absent in Python. For Java: check that all imports resolve, generics are type-safe, and the build tool's dependency resolution is consistent. For Go: verify module paths in go.mod, ensure interface implementations are complete, and check that cross-package references compile.
Produce output in the expected format: unified diff for patches, structured diagnostic for monitoring, or test file for test generation. Include a brief explanation of what was wrong and why the fix is correct, so the user can verify the reasoning.
Example 1: Build Configuration Repair (Maven Dependency Conflict)
User: "My Java project fails to build with NoSuchMethodError at runtime after upgrading Spring Boot to 3.2. The build itself succeeds but tests fail."
Approach:
pom.xml to identify Spring Boot version and all transitive dependenciesmvn dependency:tree to find conflicting library versions pulled in by different dependenciesjackson-databind is being pulled in transitively, conflicting with Spring Boot 3.2's expected version<dependencyManagement> entry pinning jackson-databind to the version compatible with Spring Boot 3.2mvn clean test to verify all tests passOutput:
<!-- pom.xml patch -->
<dependencyManagement>
<dependencies>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.16.1</version>
</dependency>
</dependencies>
</dependencyManagement>
Verification: mvn clean test exits 0, all 247 tests pass.
Example 2: Runtime Monitoring -- Memory Leak Detection
User: "Our Go service is getting OOM-killed in production after ~2 hours. Diagnose what's happening."
Approach:
free -m at t=0 to establish memory baselineps aux --sort=-rss | head -20 to identify the top memory consumerscurl localhost:6060/debug/pprof/goroutine?debug=1 if pprof is exposedOutput:
memory_leak: process=myservice pid=4821 rss_growth=30MB/min baseline=512MB current=1847MB goroutine_count=increasing(2340->8901)
Example 3: Regression Test Generation from Bug Description
User: "Bug report says: 'When a Go HTTP handler receives a request with a Content-Length header of 0 but a non-empty body, the server panics with nil pointer dereference in parseBody().' Generate a regression test."
Approach:
parseBody() in the sourcenil value causes the panic (likely an unchecked req.Body when Content-Length is 0)http.Request with Content-Length: 0 and a non-empty bytes.Buffer bodyrecover() or httptest)testing package, file naming)Output:
func TestParseBody_ZeroContentLengthWithBody(t *testing.T) {
body := bytes.NewBufferString(`{"key":"value"}`)
req := httptest.NewRequest("POST", "/api/data", body)
req.Header.Set("Content-Length", "0")
recorder := httptest.NewRecorder()
defer func() {
if r := recover(); r != nil {
t.Fatalf("parseBody panicked with Content-Length 0 and non-empty body: %v", r)
}
}()
handler.ServeHTTP(recorder, req)
if recorder.Code == http.StatusInternalServerError {
t.Error("expected successful parsing, got 500")
}
}
Verification: Test fails on buggy code (panic), passes after nil-check fix in parseBody().
mvn dependency:tree or gradle dependencies to understand transitive dependency graphs before editing build files. 37% of build failures stem from domain-specific knowledge gaps about build tool internals.Build tool not found or wrong version: Check which mvn, java -version, go version first. Install or configure the correct toolchain before attempting fixes.
Monitoring context exhaustion: System monitoring can generate enormous output. Limit top and ps to targeted queries (specific PIDs, specific metrics). Avoid dumping full system state repeatedly -- summarize trends instead of storing raw output.
Flaky tests during verification: If a test passes/fails inconsistently, run it 3 times. If it's flaky independent of your change, note this to the user and focus on the fail-to-pass tests specific to the bug.
Patch applies but introduces new failures: Revert, re-read the failing tests to understand what invariant was violated, and produce a more targeted fix. Never submit a patch that trades one failure for another.
Go module resolution failures: Run go mod tidy after any dependency change. Check that go.sum is updated. For vendored projects, run go mod vendor as well.
DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle -- Tang et al., 2026. Focus on Section 3 (task definitions and evaluation metrics), Table 1 (agent performance by category), and Section 5 (error analysis and failure modes) for the specific strategies that differentiate successful from unsuccessful agent behaviors.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".