
Performs authorized fuzz testing of binary formats and network protocols to uncover parser vulnerabilities, memory safety defects, and denial-of-service conditions. Use when assessing protocol handlers, file parsers, and service robustness against malformed inputs.
Performs authorized fuzzing of web applications and APIs to discover input validation failures, parser bugs, and stability issues. Use when testing HTTP endpoints, request parameters, payload handling, and error behavior under malformed or unexpected inputs.
Guides controlled exploitation of validated vulnerabilities to measure real-world impact. Use when the user requests proof-of-concept validation, privilege escalation testing, or attack path confirmation in an authorized environment.
Demonstrates Living-off-the-Land (LotL) techniques using native OS tools to simulate realistic threat actor behavior during authorized penetration tests. Use when proving attack feasibility without custom malware, testing detection coverage, and validating what a real adversary could achieve with only built-in system capabilities.
Creates penetration test deliverables for executive and technical audiences, including prioritized findings and remediation plans. Use when drafting, structuring, or finalizing pen test reports from collected evidence.
Evaluates whether an attacker could retain foothold and move laterally after initial compromise, within strict authorization limits. Use when testing persistence, session resilience, and detection/response effectiveness during a pen test.
Creates Nuclei YAML templates for vulnerability detection across HTTP, DNS, TCP, SSL, and other protocols. Use when converting a confirmed vulnerability, misconfiguration, or exposure into a reusable automated check — for example, turning a manual finding into a detection rule, writing a CVE check, or codifying a technology fingerprint.
Performs authorized post-exploitation activities to assess impact, lateral movement paths, credential exposure, and detection gaps after initial compromise. Use when a foothold has been validated and the test requires controlled impact expansion analysis.
Defines penetration test scope and performs authorized reconnaissance using passive and active methods. Use when planning a test engagement, collecting target intelligence, building asset inventories, or preparing recon findings.
Performs authorized security scanning using static, dynamic, and vulnerability-focused methods. Use when mapping exposed services, profiling application behavior, and identifying known weaknesses for validation.
Performs authorized web application and API penetration testing with focus on OWASP-style risks and business logic flaws. Use when assessing websites, web APIs, authentication flows, session handling, and input validation.
Produces penetration test reports with executive summary, technical findings, and remediation guidance. Use when consolidating test evidence, prioritizing risk, and preparing stakeholder-ready deliverables.
Performs authorized security assessment of embedded and IoT devices across hardware, firmware, interfaces, and update mechanisms. Use when testing device boot flows, debug interfaces, firmware integrity, and local/network attack surfaces.
Finds and removes redundant tests — tests that cover the same code, kill the same mutants, or assert the same behavior — to shrink suite runtime without losing coverage. Use when the test suite is slow, when tests have accumulated over years of copy-paste, or when CI costs are too high.
Orders tests so failures surface earliest — runs tests covering changed code first, historically flaky/failing tests early, and slow low-value tests last. Use when the suite is too slow to run in full on every change, when CI feedback takes too long, or when deciding what to run in a smoke-test tier.
Converts a model checker counterexample trace into an executable test case in the source language, so the bug found in the model is reproducible (and regression-guarded) in the real code. Use when TLC/NuSMV/Spin finds a violation and you want a failing test before writing the fix.
Infers likely loop invariants and function contracts by observing execution traces, synthesizing candidates, and checking them inductively. Use when a verifier rejects a loop because the invariant is missing or too weak, when a Daikon-style tool is needed, or before translating code to a verification language.
Uses a model checker's counterexample trace to localize the fault in the model, propose a fix, and propagate that fix back to the source code. Use when a model checker (TLC, NuSMV, Spin) finds a violation and you need to turn the trace into a code change, not just understand it.
Extracts a TLA+ specification from concurrent or distributed code, modeling the state machine, actions, and fairness conditions for model checking with TLC. Use when verifying concurrency properties of production code, when designing a protocol and wanting to check it before implementation, or when the user has a race condition and needs to prove the fix.
Translates natural-language requirements into TLA+ properties — invariants for safety, temporal formulas for liveness — checkable with TLC. Use when writing the PROPERTY and INVARIANT sections of a TLA+ spec, when formalizing acceptance criteria, or when the user has a requirement and a model but no property.
Reduces a TLA+ model so TLC can actually check it — shrinks constants, adds state constraints, abstracts data, or applies symmetry — when the state space is too large to enumerate. Use when TLC runs out of memory, when checking takes hours, or when a spec works at N=2 and you need confidence at larger scale.
Generates abstract invariants using domain abstraction — intervals, octagons, polyhedra, sign domains — to find invariants that concrete reasoning misses. Use when standard invariant inference fails, when the invariant involves relationships between multiple variables, or when verifying numerical code.
Detects ambiguity in natural-language requirements — weak words, dangling references, underspecified quantities, conflicting interpretations — before they become implementation bugs. Use when reviewing requirements, when a spec uses words like "appropriate" or "fast", or when two engineers read the same requirement and built different things.
Verifies that a refactoring or transformation preserved observable behavior by comparing before and after execution, differential testing, or I/O capture. Use after a refactoring, after automated code transformation, before merging a structural PR, or whenever the claim is that two code versions do the same thing.
Creates minimal, reproducible test cases from bug reports to confirm the defect before and after a fix. Use when a bug is reported without a failing test, when the user needs a regression test for a fix, or when the user asks to reproduce a bug as a test.
Assists migrating a build or CI pipeline from one system to another — Jenkins to GitHub Actions, Travis to GitLab CI, Makefile to Bazel — preserving semantics and surfacing untranslatable constructs. Use when switching CI providers, when modernizing a legacy build, or when the user pastes a Jenkinsfile and asks for the GitHub Actions equivalent.
Generates deployment pipelines with environment promotion, approval gates, and rollback triggers based on target infrastructure. Use when wiring automated deployments from CI to staging/production, when the user asks for a release pipeline, or when adding promotion gates to an existing deploy workflow.
Generates a structured CHANGELOG.md from VCS history and PR/issue references, categorized by change type. Use when cutting a release, when the user asks to update CHANGELOG.md, or when backfilling a changelog from git history.
Generates CI pipeline configs by analyzing a repo's structure, language, and build needs — GitHub Actions, GitLab CI, or other platforms. Use when bootstrapping CI for a new repo, when porting from one CI to another, when the user asks for a pipeline that builds and tests their project, or when wiring in security gates.
Identifies recurring structural patterns in a codebase — idioms, copy-paste clones, homegrown abstractions — and characterizes each as a reusable template. Use when learning a codebase's conventions, when hunting for copy-paste that should be a function, or when documenting how this team does things.
Identifies code smells — structural patterns that correlate with maintainability problems — and explains why each matters in context. Use when reviewing a PR for structural quality, when the user asks what's wrong with a piece of code that isn't buggy, or when prioritizing refactoring targets.
Produces natural-language summaries of what code does at the function, class, module, or subsystem level, with length and abstraction scaled to the scope. Explains purpose, side effects, and non-obvious behavior rather than restating syntax. Use when onboarding to unfamiliar code, when the user asks what something does, when generating docstrings or architecture notes, or when preparing a handoff document.
Translates a single function or small code unit between programming languages, mapping idioms and preserving observable behavior. Use when porting one function, when the user pastes code and asks for it in another language, or as the per-unit primitive for larger migrations.
Identifies natural component boundaries inside a monolith by clustering the dependency graph, finding the cuts with minimum coupling. Use when planning to modularize or extract microservices, when deciding what can be deployed independently, or when the user asks where the seams in this codebase are.
Generates configuration files for services and tools (app config, logging config, linter config, database config) from a brief description of desired behavior, matching the target format's idioms. Use when bootstrapping a new service, when the user asks for a config file for a specific tool, or when translating config intent between formats.
Generates hardened, multi-stage Dockerfiles with non-root users, minimal base images, and a .dockerignore, after auto-detecting the application stack. Use when containerizing an application for the first time, when the user asks for a Dockerfile, when migrating from a VM deployment, or when an existing Dockerfile runs as root, uses a fat base image, or leaks build tooling into the runtime layer.
Interprets and explains counterexamples produced by model checkers or property-based testing tools to make them actionable. Use when TLC, NuSMV, CBMC, or a property-based test emits a counterexample the user doesn't understand, when a trace is too long to read, or when mapping a model-level trace back to source code.
Raises test coverage by identifying uncovered code regions, ranking them by risk, and generating targeted tests that hit them — prioritizing branches and conditions over raw line count. Use when coverage is below target, when untested code is blocking a release, or when deciding which tests to write next.
Translates C++ functions into Dafny for formal verification, modeling pointers, fixed-width integers, and manual memory as Dafny heap objects and bitvectors. Use when verifying a C++ algorithm, when proving absence of overflow or out-of-bounds access, or when building a verified reference for safety-critical C++ code.
Finds and safely removes code that is never executed — unreachable branches, uncalled functions, unused classes, dead feature flags. Use when cleaning up after a feature removal, when the user suspects the codebase has accumulated cruft, or when reducing build/bundle size.
Diagnoses and resolves package dependency conflicts — version mismatches, diamond dependencies, cycles — across npm, pip, Maven, Cargo, and similar ecosystems. Use when install fails with a resolution error, when two packages require incompatible versions of a third, or when upgrading one dependency breaks another.
Dispatch skill — routes a formal specification request to the right concrete generator based on what's being specified and what needs to be proven. Use when the user asks to formally specify something without naming a target formalism, or when unsure which verification tool fits the problem.
Generates JUnit regression tests that lock in current behavior before a refactor, capturing observed outputs as assertions so that any behavioral change trips a test. Use before large refactors, when inheriting untested legacy Java, or when the spec is "whatever it does now."
Translates an entire module or package between languages, handling imports, file layout, visibility, and cross-function dependencies that single-function translation misses. Use when porting a library, when a migration spans multiple files, or when the user hands you a directory and a target language.
Uses mutation testing to find weak assertions and missing tests — injects small bugs and checks if the suite catches them, then generates tests targeting the surviving mutants. Use when coverage is high but bugs still ship, when auditing test quality, or when deciding if the suite is good enough.
Recommends the specific code change to remediate a detected vulnerability by dispatching on CWE to the matching Project CodeGuard rule's prescribed fix pattern. Use after a finding has been confirmed and located, when the user asks how to fix a vulnerability, or when generating remediation PRs.
Translates pseudocode into idiomatic Java, inferring types, choosing collection classes, and handling exceptions per Java conventions. Use when implementing an algorithm from a paper or spec, when the user hands you pseudocode and wants Java, or when realizing a verified-pseudocode artifact.
Generates pytest regression tests that capture current behavior as snapshot assertions, using Python's dynamism for low-friction recording. Use before refactoring untested Python, when the behavioral spec is "whatever it does now," or when migrating Python 2→3 or between framework versions.
Translates Python functions into Dafny, adding types, pre/postconditions, and loop invariants sufficient for Dafny to verify. Use when formally verifying a Python algorithm, when the user wants machine-checked correctness for a critical function, or when building a verified reference implementation.
Translates Python into Lean 4 for interactive theorem proving, handling dynamic types and duck typing by specializing to the concrete types actually used. Use when proving correctness of a Python algorithm beyond what testing can establish, or when building a verified reference for numerical or combinatorial Python code.
Traces regressions to the specific commit, change, or code path that introduced the behavioral breakage. Use when a previously passing test or feature now fails, when the user asks what change caused a regression, or when bisecting a regression across commits.
Checks whether an implementation covers a set of requirements by tracing each requirement to code, tests, or both — and flagging gaps where a requirement has no evidence of implementation. Use when auditing for compliance, when answering "is this spec implemented", or before claiming a standard is supported.
Rewrites vague or incomplete requirements into precise, testable statements — filling in quantities, actors, conditions, and error behavior while preserving intent. Use after ambiguity-detector flags problems, when a requirement can't be turned into a test, or when engineers keep asking the same clarification questions.
Alias for requirement-summarizer. Produces a structured summary of a requirements document — the key obligations, grouped by actor and concern, with the MUST/SHOULD/MAY breakdown. Use when onboarding to a large spec, when deciding what to implement first, or when the user asks what a standard actually requires.
Advises on rollback strategies by analyzing what a deploy changes — recommending revert, roll-forward, feature-flag kill, or data repair depending on reversibility. Use during an incident when a deploy went bad, when designing a deploy pipeline and the user asks how to make it reversible, or when a migration needs an undo plan.
Translates cryptic runtime error messages and stack traces into understandable explanations, pointing to the concrete line at fault and the most likely fix. Use when a user pastes an error they don't understand, when a stack trace is deep and the user doesn't know where to start, or when an error message misleads about the real cause.
Extends classic SZZ with semantic code understanding to reduce false positives and improve accuracy of bug-introducing commit identification. Use after classic SZZ has produced candidates, when SZZ precision is too low for the task, or when the user needs high-confidence bug-introduction data.
Translates specifications into temporal logic formulas (LTL, CTL, or TLA) by matching the specification's shape to the right logic and operators. Use when formalizing requirements for any model checker, when choosing between LTL and CTL for a property, or when the user has a temporal claim and doesn't know which operators express it.
Migrates a Spring MVC application to Spring Boot, converting XML config to auto-configuration, restructuring the project, and replacing container deployment with embedded. Use when modernizing a legacy Spring app, when moving off a standalone servlet container, or when the user has web.xml and wants application.yml.
Identifies bugs through static code analysis (null dereferences, type mismatches, control flow issues) without executing the program. Use when scanning code for defects before running tests, when the user asks for static analysis, or when integrating with CI for defect detection.
Scans source code for security vulnerabilities by applying Project CodeGuard rules — injection, unsafe deserialization, XSS, path traversal, broken access control. Use when performing a security audit, when reviewing a PR that touches request handlers or database queries, when the user asks for a vulnerability scan, or when wiring security checks into CI.
Applies the SZZ algorithm to VCS history to identify which commits introduced bugs by correlating bug-fix commits with earlier changes. Use when mining a repository for bug-introducing commits, when building a defect-prediction dataset, or when the user asks which commit introduced a given fixed bug.
Analyzes a codebase to quantify and locate technical debt — where it lives, what it costs, and what order to pay it down in. Use when planning a refactoring sprint, when justifying engineering time to stakeholders, when the user asks where the codebase hurts most, or when onboarding to a legacy system.
Writes documentation for test cases — names, docstrings, and comments that explain what behavior is being tested and why, so a failing test tells you what broke without reading the assertion. Use when test names are test_1 through test_47, when tests fail and nobody knows what they mean, or when onboarding needs a readable test suite.
Shrinks a failing test input to its minimal form while preserving the failure — delta debugging and structured shrinking to find the smallest input that still triggers the bug. Use when a fuzzer or property test finds a failure with a huge input, when a bug report has an unwieldy reproduction, or when you need a minimal test case for a regression suite.
Generates code test-first — writes a failing test from a requirement, then generates the minimal code to pass it, then refactors, in strict red-green-refactor cycles. Use when building new features where the spec is clear, when the design is uncertain and you want tests to drive it, or when you need high confidence in coverage from the start.
Generates test oracles — the "expected output" part of a test — by choosing among reference implementations, invariants, inverse functions, or differential comparison when the correct answer isn't obvious. Use when the hard part of testing is knowing what the right answer is, not generating inputs.
Builds a bidirectional traceability matrix linking requirements to design elements, code, and tests — so every requirement traces forward to its implementation and every test traces back to its justification. Use for compliance audits, when answering why a piece of code exists, or when checking that nothing was built without a reason.
Generates unit tests for a function or class by analyzing branches, boundaries, and error paths — then emits test code in the project's existing framework and style. Covers happy path, edge cases, and failure modes with mocks for external dependencies. Use when writing tests for new code, when backfilling coverage on untested functions, when the user asks to generate tests, or when a coverage report shows specific gaps.
Extracts human-readable pseudocode from a verified formal artifact (Dafny, Lean, TLA+) while preserving the verified properties as annotations, so the proof-carrying logic can be reimplemented in a production language. Use when porting verified code to an unverified target, when documenting what a formal spec actually does, or when handing a verified algorithm to an implementer.
Optimizes code for performance by identifying the actual bottleneck, choosing the right optimization lever, and measuring the result. Use when a specific operation is too slow, when a profiler has pointed at a hot path, or when the user asks to make something faster.
Reviews and designs API contracts — function signatures, REST endpoints, library interfaces — for usability, evolvability, and the principle of least surprise. Use when designing a new public interface, when reviewing an API PR, when the user asks whether a signature is well-designed, or when planning a breaking change.
Transforms a changelog or commit range into user-friendly release notes with highlights, upgrade guidance, and narrative framing. Use when publishing a release announcement, when the changelog is too dense for users to read, or when the user needs a blog-post-shaped summary of a version.
Uses an existing test suite as the behavioral oracle during a migration, tracking which tests pass at each step and localizing regressions to specific migration changes. Use when porting or refactoring code that has tests, when the user wants to migrate incrementally with a safety net, or when a migration broke something and you need to find which step did it.
Finds code by meaning, structure, or text across large codebases — picks the right search strategy (grep, AST query, call graph walk, semantic search) for the question being asked. Use when the user asks where something is implemented, when navigating unfamiliar code, or when a simple grep isn't enough.
Extracts language-agnostic pseudocode from real code, stripping syntax noise and language-specific machinery while preserving the algorithmic structure. Use when documenting an algorithm for a paper or spec, when porting and wanting a neutral intermediate, or when explaining code to someone who doesn't know the source language.
Translates pseudocode into idiomatic Python, choosing the right standard-library structures and leveraging Python idioms that pseudocode doesn't express. Use when implementing an algorithm from a paper or spec, when the user hands you pseudocode and wants Python, or when realizing a verified-pseudocode artifact.
Executes refactorings — extract method, inline, rename, move — in small, behavior-preserving steps with a test between each. Use when the user wants to restructure working code, when cleaning up after a feature lands, or when a smell has been identified and needs fixing.
Recognizes structural situations that match known design patterns and recommends whether to apply them — or explains why the pattern doesn't fit. Use when the user has a structural problem and is considering a pattern, when reviewing a design that uses a pattern questionably, or when the user asks which pattern fits their situation.
Automatically synthesizes code patches to fix identified bugs, leveraging the bug location and surrounding context. Use when a bug has been localized and the user wants an automated fix, when generating candidate patches for review, or when the user asks to fix a specific bug.
Uses failing test results as signals to guide bug search and narrow down candidate fault locations. Use when one or more tests are failing and the user wants to understand what's broken, when CI reports failures, or when triaging a batch of test failures after a change.
Produces a structured summary of a requirements document — the key obligations, grouped by actor and concern, with the MUST/SHOULD/MAY breakdown. Use when onboarding to a large spec, when deciding what to implement first, or when the user asks what a 200-page standard actually requires.
Matches code against Project CodeGuard's catalog of known-dangerous patterns — banned C functions, weak crypto primitives, hardcoded credentials, deprecated APIs. Use when grepping for low-hanging security fruit, when enforcing a ban-list in CI, or when the user asks to check for known-bad patterns.
Updates broken pytest tests after intentional code changes — triaging assertion failures from mock-coupling failures from genuine regressions, using Python's introspection to automate where safe. Use when a refactor or API change leaves a pile of failing tests and you need to decide update vs. fix vs. delete.
Generates domain-specific mutation operators beyond the standard arithmetic/relational set — mutations tailored to your codebase's idioms, APIs, and bug history that standard tools don't try. Use when generic mutation testing plateaus, when your domain has specific failure modes, or when mining bug history reveals patterns standard operators miss.
Translates C/C++ into Lean 4 for interactive theorem proving — deep verification where automated tools fail. Use when Dafny's automation isn't enough, when proving mathematical properties of an algorithm, or when building a machine-checked proof for publication or certification.
TLA+-specific instance of model-guided repair — reads a TLC error trace, identifies the enabling condition that should have been false, strengthens the corresponding action, and maps the fix to source code. Use when TLC reports an invariant violation or deadlock and you have the code-to-TLA+ mapping from extraction.
Translates natural-language or pseudocode descriptions of concurrent and distributed systems into TLA+ specifications ready for the TLC model checker. Identifies state variables, actions, type invariants, safety properties, and liveness properties from the description. Use when formalizing a protocol, when the user describes a distributed algorithm to verify, when designing a consensus or locking scheme, or when starting formal verification of a concurrent system.
Extracts an SMV (NuSMV/nuXmv) finite-state model from code or state-machine descriptions, for CTL/LTL model checking of reactive systems. Use when verifying hardware-adjacent or embedded logic, when the state space is naturally finite and small, or when CTL branching-time properties are needed.
Pinpoints the exact file, function, or line in a codebase responsible for a reported bug using static and dynamic analysis signals. Use when a bug is reported but the fault location is unknown, when narrowing down a failure to a specific code region, when triaging an issue tracker ticket, or when the user asks to locate where a bug originates.
Detects logical and semantic bugs by understanding program intent — catches issues that syntax-only tools miss. Use when static analysis has already run and found nothing, when the user reports incorrect behavior but no crash, or when reviewing algorithmic code for correctness.
Generates code comments that explain non-obvious intent, constraints, and tradeoffs — not what the code already says. Use when code is correct but opaque, when documenting for future maintainers, or when a function's why is harder to see than its what.
Generates metamorphic tests — tests that check relationships between multiple runs instead of checking exact outputs, useful when the correct output is unknown or expensive to compute. Use when there's no oracle, when testing ML/numerical/search code, or when the spec describes properties rather than values.
Summarizes undocumented legacy code by inferring intent from structure, naming, data flow, and calling context — explicitly flagging what's inferred vs. what's certain. Use when onboarding to inherited code, when documentation is missing or wrong, or when deciding whether legacy code is safe to change.
Compares the runtime behavior of two or more versions of the same code by running them on identical inputs and diffing outputs, side effects, and errors. Use when validating a refactor, port, or optimization; when the user asks if two implementations behave the same; or when investigating a suspected regression across versions.
Performs structured code review on a diff or file set, producing inline comments with severity levels and a summary. Checks correctness, error handling, security, and maintainability — in that priority order. Use when reviewing a pull request, when the user asks for a code review, when preparing code for merge, or when a second opinion is needed on a change.
Proves two program fragments semantically equivalent using symbolic reasoning — stronger than testing, applicable when differential testing is insufficient or impossible. Use when behavior preservation must be proven rather than sampled, when the input space is too large to enumerate, or when a transformation needs a correctness argument.
Detects inconsistencies across configuration files, environments, and deployment manifests — missing keys, drifted values, type mismatches. Use when debugging why staging behaves differently from production, before a deploy to catch config drift, or when auditing multi-environment configs.
Translates natural-language requirements into formal constraints — logical predicates, schemas, or property-based test generators — that a machine can check. Use when turning a spec into validation code, when writing property tests, or as the bridge between requirements and formal verification.
Derives executable test cases directly from requirements (user stories, acceptance criteria, specs) by extracting testable conditions, enumerating equivalence classes and boundaries, and producing a traceability map from each test back to its source requirement. Use when building acceptance tests from a spec, when checking whether requirements are covered by existing tests, when translating Gherkin or plain-English criteria into code, or when proving coverage for compliance.
Compares two versions of a requirements document and reports additions, removals, semantic changes, and scope drift — distinguishing clerical edits from meaning changes. Use when a spec was revised, when checking if a new version of a standard affects you, or when the user asks what changed between spec versions.
Generates concrete scenarios from a requirement — happy paths, edge cases, and error conditions — expressed as Given/When/Then or equivalent structured narratives. Use when turning a requirement into acceptance tests, when exploring what could go wrong, or when the requirement is abstract and needs grounding.
Sets up taint tracking by defining sources, sinks, and sanitizers from Project CodeGuard's input-validation taxonomy, then configures the target tool (CodeQL, Semgrep, custom instrumentation). Use when wiring taint analysis into CI, when the user asks for taint tracking, or when you need a source/sink catalog for a specific language.
Updates broken JUnit tests after a deliberate code change — distinguishing tests that broke because the behavior changed (update assertion) from tests that broke because they were overcoupled to structure (loosen or delete). Use after API changes, refactors, or intentional behavior changes leave a trail of failing tests.
Generates tests that mock external dependencies — HTTP, databases, filesystems, clocks — isolating the unit under test while still exercising realistic interactions. Use when the code has side effects you can't run in a test, when external services are slow or unavailable, or when testing error paths that are hard to trigger for real.