Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

santosomar/behavior-preservation-checker

Name: behavior-preservation-checker
Author: santosomar

skills/code-quality/behavior-preservation-checker/SKILL.md

npx skillsauth add santosomar/general-secure-coding-agent-skills behavior-preservation-checker

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Behavior Preservation Checker

"It's just a refactor" is a claim. This skill checks the claim: does the new code produce the same observable behavior as the old code on the inputs that matter?

Approaches — cheapest to strongest

| Approach | Checks | Cost | Confidence | | --------------------------- | ---------------------------------------------- | ------- | -------------------------------------- | | Run the existing tests | Whatever the tests assert | Free | As good as your test suite — often not very | | Differential testing | Old and new produce same output on random/prod inputs | Low | High where you can enumerate inputs | | Golden-master / snapshot | Output matches a recorded baseline byte-for-byte | Low | Very high for serialized output; brittle | | Side-effect capture | Same DB writes, same HTTP calls, same log lines | Medium | Catches effects tests usually miss | | Property-based equivalence | ∀x. old(x) == new(x) over generated inputs | Medium | High for pure functions | | Formal equivalence proof | Proven equal by construction | High | Absolute — → semantic-equivalence-verifier |

Use the cheapest one that gives you the confidence you need. Differential testing covers 90% of cases.

Differential testing — the workhorse

for each input in <sample>:
    old_out = old_version(input)
    new_out = new_version(input)
    if old_out != new_out:
        REPORT divergence

Where <sample> comes from:

Production replay: Captured real inputs. Best signal.
Existing test inputs: What your tests already feed in. Cheap but narrow.
Fuzz / property generation: Random inputs in the valid domain. Broad but can miss real-world shapes.

What counts as "same behavior"

Decide before comparing — not after you see a diff:

| Observable | Must match? | | --------------------------- | ------------------------------------------------------------------ | | Return value | Yes — by definition | | Exception type + message | Type yes; message… usually yes but debatable | | Side effects (DB, files, network) | Yes — this is where refactors silently break | | Side-effect order | Depends — was order specified, or incidental? | | Log output | Usually no — logs are diagnostics, not contract | | Timing / performance | Usually no — unless that's the contract | | Iteration order | Depends — was it dict (unordered pre-3.7) or list (ordered)? | | Float precision | Equal within ε, not bit-exact — define ε upfront |

Write down the equivalence relation. "Same return value, same DB writes (order-insensitive), ignore logs, floats within 1e-9."

Worked example

Change: Refactored compute_tax(order) — was a 60-line method, now calls three helpers.

Setup: Both versions available — old commit checked out in a sibling worktree.

# differential_test.py
from old.tax import compute_tax as old_compute
from new.tax import compute_tax as new_compute

def test_equivalence(sample_orders):         # 500 orders from prod snapshot
    for order in sample_orders:
        old = old_compute(order)
        new = new_compute(order)
        assert abs(old - new) < 0.001, f"diverged on {order.id}: {old} != {new}"

Run: 498 match. 2 diverge:

Order #44291: old=12.50, new=12.49. Off by a cent.
Order #81007: old=0.00, new=0.00. Wait — match? Rerun: old raised KeyError (test swallowed it). new returned 0.00.

Findings:

The penny diff — new code rounds at a different step. Accumulation order changed. Real behavior change. Either a latent bug in old code (fixed accidentally) or a regression — need domain judgment.
Order #81007 — old code crashed on orders with no tax jurisdiction set; new code returns 0. Real behavior change. Probably an improvement, but it's not "just a refactor."

Side-effect capture

For non-pure functions, return value isn't enough. Capture effects:

with capture_sql() as old_queries:
    old_fn(x)
with capture_sql() as new_queries:
    new_fn(x)
assert normalize(old_queries) == normalize(new_queries)

normalize = sort if order doesn't matter, strip timestamps, etc. — per your equivalence relation.

Edge cases

Old version had a bug: Differential testing will flag the fix as a divergence. That's correct — it IS a behavior change. Report it; let the human decide it's a desired change.
Nondeterminism (threads, random, time): Both versions produce different outputs run-to-run. Seed the RNG; freeze the clock; serialize the threads. If you can't, you can only compare distributions, not values.
Inputs with side effects (reading an iterator exhausts it): Can't feed the same input to both. Clone/tee the input, or record-replay.
The refactor changed the signature: Write an adapter so both versions take the same shape. The adapter is part of the refactor.

Do not

Do not accept "tests pass" as sufficient evidence for a large refactor. Tests cover the paths someone thought of. Production inputs cover the paths that actually happen.
Do not decide what "equivalent" means after you see the diffs. You'll rationalize every divergence. Write the relation first.
Do not ignore side-effect divergence because return values match. An extra DB write is a behavior change.
Do not treat a divergence as automatically a bug. Sometimes the old behavior was the bug. But it IS a change, and the PR shouldn't claim "no behavior change."

Output format

## Equivalence relation
<return values | side effects | what's compared, what's ignored>

## Sample
<N> inputs from <source>

## Result
<N-k> equivalent
<k> divergent:
  input=<summary>  old=<val>  new=<val>
    <verdict: regression | latent-bug-fix | incidental | needs-review>

## Confidence
<high | medium | low — based on sample coverage>

santosomar/behavior-preservation-checker

skills/code-quality/behavior-preservation-checker/SKILL.md

Verifies that a refactoring or transformation preserved observable behavior by comparing before and after execution, differential testing, or I/O capture. Use after a refactoring, after automated code transformation, before merging a structural PR, or whenever the claim is that two code versions do the same thing.

development

Updated Apr 13, 2026

$ install --global

skillsauth

npx skillsauth add santosomar/general-secure-coding-agent-skills behavior-preservation-checker

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 13, 2026, 4:02 AM99.5s1 file scanned

SKILL.md

name:: behavior-preservation-checker
description:: Verifies that a refactoring or transformation preserved observable behavior by comparing before and after execution, differential testing, or I/O capture. Use after a refactoring, after automated code transformation, before merging a structural PR, or whenever the claim is that two code versions do the same thing.
license:: Apache-2.0
category:: code-quality
suite:: general-secure-coding-agent-skills
version:: 0.3.0
related:: semantic-equivalence-verifier, code-refactoring-assistant, multi-version-behavior-comparator

Behavior Preservation Checker

"It's just a refactor" is a claim. This skill checks the claim: does the new code produce the same observable behavior as the old code on the inputs that matter?

Approaches — cheapest to strongest

Use the cheapest one that gives you the confidence you need. Differential testing covers 90% of cases.

Differential testing — the workhorse

for each input in <sample>:
    old_out = old_version(input)
    new_out = new_version(input)
    if old_out != new_out:
        REPORT divergence

Where <sample> comes from:

Production replay: Captured real inputs. Best signal.
Existing test inputs: What your tests already feed in. Cheap but narrow.
Fuzz / property generation: Random inputs in the valid domain. Broad but can miss real-world shapes.

What counts as "same behavior"

Decide before comparing — not after you see a diff:

Write down the equivalence relation. "Same return value, same DB writes (order-insensitive), ignore logs, floats within 1e-9."

Worked example

Change: Refactored compute_tax(order) — was a 60-line method, now calls three helpers.

Setup: Both versions available — old commit checked out in a sibling worktree.

# differential_test.py
from old.tax import compute_tax as old_compute
from new.tax import compute_tax as new_compute

def test_equivalence(sample_orders):         # 500 orders from prod snapshot
    for order in sample_orders:
        old = old_compute(order)
        new = new_compute(order)
        assert abs(old - new) < 0.001, f"diverged on {order.id}: {old} != {new}"

Run: 498 match. 2 diverge:

Order #44291: old=12.50, new=12.49. Off by a cent.
Order #81007: old=0.00, new=0.00. Wait — match? Rerun: old raised KeyError (test swallowed it). new returned 0.00.

Findings:

The penny diff — new code rounds at a different step. Accumulation order changed. Real behavior change. Either a latent bug in old code (fixed accidentally) or a regression — need domain judgment.
Order #81007 — old code crashed on orders with no tax jurisdiction set; new code returns 0. Real behavior change. Probably an improvement, but it's not "just a refactor."

Side-effect capture

For non-pure functions, return value isn't enough. Capture effects:

with capture_sql() as old_queries:
    old_fn(x)
with capture_sql() as new_queries:
    new_fn(x)
assert normalize(old_queries) == normalize(new_queries)

normalize = sort if order doesn't matter, strip timestamps, etc. — per your equivalence relation.

Edge cases

Old version had a bug: Differential testing will flag the fix as a divergence. That's correct — it IS a behavior change. Report it; let the human decide it's a desired change.
Nondeterminism (threads, random, time): Both versions produce different outputs run-to-run. Seed the RNG; freeze the clock; serialize the threads. If you can't, you can only compare distributions, not values.
Inputs with side effects (reading an iterator exhausts it): Can't feed the same input to both. Clone/tee the input, or record-replay.
The refactor changed the signature: Write an adapter so both versions take the same shape. The adapter is part of the refactor.

Do not

Do not accept "tests pass" as sufficient evidence for a large refactor. Tests cover the paths someone thought of. Production inputs cover the paths that actually happen.
Do not decide what "equivalent" means after you see the diffs. You'll rationalize every divergence. Write the relation first.
Do not ignore side-effect divergence because return values match. An extra DB write is a behavior change.
Do not treat a divergence as automatically a bug. Sometimes the old behavior was the bug. But it IS a change, and the PR shouldn't claim "no behavior change."

Output format

## Equivalence relation
<return values | side effects | what's compared, what's ignored>

## Sample
<N> inputs from <source>

## Result
<N-k> equivalent
<k> divergent:
  input=<summary>  old=<val>  new=<val>
    <verdict: regression | latent-bug-fix | incidental | needs-review>

## Confidence
<high | medium | low — based on sample coverage>

Related Skills

santosomar/verified-pseudocode-extractor

development

VerifiedTrustedCommunity

Extracts human-readable pseudocode from a verified formal artifact (Dafny, Lean, TLA+) while preserving the verified properties as annotations, so the proof-carrying logic can be reimplemented in a production language. Use when porting verified code to an unverified target, when documenting what a formal spec actually does, or when handing a verified algorithm to an implementer.

SKILL.mdUpdated Apr 13, 2026

santosomar/verified-pseudocode-extractor

santosomar/tlaplus-spec-generator

development

VerifiedTrustedCommunity

Translates natural-language or pseudocode descriptions of concurrent and distributed systems into TLA+ specifications ready for the TLC model checker. Identifies state variables, actions, type invariants, safety properties, and liveness properties from the description. Use when formalizing a protocol, when the user describes a distributed algorithm to verify, when designing a consensus or locking scheme, or when starting formal verification of a concurrent system.

SKILL.mdUpdated Apr 13, 2026

santosomar/tlaplus-spec-generator

santosomar/tlaplus-model-reduction

testing

VerifiedTrustedCommunity

Reduces a TLA+ model so TLC can actually check it — shrinks constants, adds state constraints, abstracts data, or applies symmetry — when the state space is too large to enumerate. Use when TLC runs out of memory, when checking takes hours, or when a spec works at N=2 and you need confidence at larger scale.

SKILL.mdUpdated Apr 13, 2026

santosomar/tlaplus-model-reduction

santosomar/tlaplus-guided-code-repair

development

VerifiedTrustedCommunity

TLA+-specific instance of model-guided repair — reads a TLC error trace, identifies the enabling condition that should have been false, strengthens the corresponding action, and maps the fix to source code. Use when TLC reports an invariant violation or deadlock and you have the code-to-TLA+ mapping from extraction.

SKILL.mdUpdated Apr 13, 2026

santosomar/tlaplus-guided-code-repair

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/santosomar/general-secure-coding-agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r general-secure-coding-agent-skills/skills/code-quality/behavior-preservation-checker ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

santosomar/general-secure-coding-agent-skills

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT