skills/code-quality/behavior-preservation-checker/SKILL.md
Verifies that a refactoring or transformation preserved observable behavior by comparing before and after execution, differential testing, or I/O capture. Use after a refactoring, after automated code transformation, before merging a structural PR, or whenever the claim is that two code versions do the same thing.
npx skillsauth add santosomar/general-secure-coding-agent-skills behavior-preservation-checkerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
"It's just a refactor" is a claim. This skill checks the claim: does the new code produce the same observable behavior as the old code on the inputs that matter?
| Approach | Checks | Cost | Confidence |
| --------------------------- | ---------------------------------------------- | ------- | -------------------------------------- |
| Run the existing tests | Whatever the tests assert | Free | As good as your test suite — often not very |
| Differential testing | Old and new produce same output on random/prod inputs | Low | High where you can enumerate inputs |
| Golden-master / snapshot | Output matches a recorded baseline byte-for-byte | Low | Very high for serialized output; brittle |
| Side-effect capture | Same DB writes, same HTTP calls, same log lines | Medium | Catches effects tests usually miss |
| Property-based equivalence | ∀x. old(x) == new(x) over generated inputs | Medium | High for pure functions |
| Formal equivalence proof | Proven equal by construction | High | Absolute — → semantic-equivalence-verifier |
Use the cheapest one that gives you the confidence you need. Differential testing covers 90% of cases.
for each input in <sample>:
old_out = old_version(input)
new_out = new_version(input)
if old_out != new_out:
REPORT divergence
Where <sample> comes from:
Decide before comparing — not after you see a diff:
| Observable | Must match? |
| --------------------------- | ------------------------------------------------------------------ |
| Return value | Yes — by definition |
| Exception type + message | Type yes; message… usually yes but debatable |
| Side effects (DB, files, network) | Yes — this is where refactors silently break |
| Side-effect order | Depends — was order specified, or incidental? |
| Log output | Usually no — logs are diagnostics, not contract |
| Timing / performance | Usually no — unless that's the contract |
| Iteration order | Depends — was it dict (unordered pre-3.7) or list (ordered)? |
| Float precision | Equal within ε, not bit-exact — define ε upfront |
Write down the equivalence relation. "Same return value, same DB writes (order-insensitive), ignore logs, floats within 1e-9."
Change: Refactored compute_tax(order) — was a 60-line method, now calls three helpers.
Setup: Both versions available — old commit checked out in a sibling worktree.
# differential_test.py
from old.tax import compute_tax as old_compute
from new.tax import compute_tax as new_compute
def test_equivalence(sample_orders): # 500 orders from prod snapshot
for order in sample_orders:
old = old_compute(order)
new = new_compute(order)
assert abs(old - new) < 0.001, f"diverged on {order.id}: {old} != {new}"
Run: 498 match. 2 diverge:
12.50, new=12.49. Off by a cent.0.00, new=0.00. Wait — match? Rerun: old raised KeyError (test swallowed it). new returned 0.00.Findings:
For non-pure functions, return value isn't enough. Capture effects:
with capture_sql() as old_queries:
old_fn(x)
with capture_sql() as new_queries:
new_fn(x)
assert normalize(old_queries) == normalize(new_queries)
normalize = sort if order doesn't matter, strip timestamps, etc. — per your equivalence relation.
## Equivalence relation
<return values | side effects | what's compared, what's ignored>
## Sample
<N> inputs from <source>
## Result
<N-k> equivalent
<k> divergent:
input=<summary> old=<val> new=<val>
<verdict: regression | latent-bug-fix | incidental | needs-review>
## Confidence
<high | medium | low — based on sample coverage>
development
Extracts human-readable pseudocode from a verified formal artifact (Dafny, Lean, TLA+) while preserving the verified properties as annotations, so the proof-carrying logic can be reimplemented in a production language. Use when porting verified code to an unverified target, when documenting what a formal spec actually does, or when handing a verified algorithm to an implementer.
development
Translates natural-language or pseudocode descriptions of concurrent and distributed systems into TLA+ specifications ready for the TLC model checker. Identifies state variables, actions, type invariants, safety properties, and liveness properties from the description. Use when formalizing a protocol, when the user describes a distributed algorithm to verify, when designing a consensus or locking scheme, or when starting formal verification of a concurrent system.
testing
Reduces a TLA+ model so TLC can actually check it — shrinks constants, adds state constraints, abstracts data, or applies symmetry — when the state space is too large to enumerate. Use when TLC runs out of memory, when checking takes hours, or when a spec works at N=2 and you need confidence at larger scale.
development
TLA+-specific instance of model-guided repair — reads a TLC error trace, identifies the enabling condition that should have been false, strengthens the corresponding action, and maps the fix to source code. Use when TLC reports an invariant violation or deadlock and you have the code-to-TLA+ mapping from extraction.