skills/testing/python-test-updater/SKILL.md
Updates broken pytest tests after intentional code changes — triaging assertion failures from mock-coupling failures from genuine regressions, using Python's introspection to automate where safe. Use when a refactor or API change leaves a pile of failing tests and you need to decide update vs. fix vs. delete.
npx skillsauth add santosomar/general-secure-coding-agent-skills python-test-updaterInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Same triage discipline as → java-test-updater. Python differences: no compile-time breaks (everything fails at runtime), more mocker.patch coupling, and snapshot libraries make some updates one command.
No compile step means everything surfaces at test runtime:
| Failure | Python-specific signal | Action |
| --------------------------------------------- | ----------------------------------------------- | ------------------------ |
| AttributeError: 'X' has no attribute 'foo' | Renamed/removed method | Update call site |
| TypeError: f() missing 1 positional argument | Signature changed | Add arg or use default |
| TypeError: f() got an unexpected keyword | Kwarg renamed/removed | Update kwarg |
| ImportError / ModuleNotFoundError | Module moved | Update import |
| AssertionError with value diff | Behavior changed (intentional?) or regression | Triage |
| AssertionError: Expected 'mock' to be called | Over-mocked internal | Loosen/delete mock |
| AssertionError in snapshot compare | Snapshot stale | Review diff, --snapshot-update if intentional |
Python's runtime errors point straight at the problem. For signature changes across many tests:
# conftest.py — one-time shim during migration
# OLD: Order(items, region) NEW: Order(items, region, currency="USD")
# Tests still call Order(items, region). Temporary compat:
@pytest.fixture(autouse=True)
def _order_compat(mocker):
orig_init = Order.__init__
def compat_init(self, items, region, currency="USD"):
orig_init(self, items, region, currency)
mocker.patch.object(Order, "__init__", compat_init)
This is a bridge, not a fix. All tests pass → remove the shim → fix tests one by one as you touch them. Don't leave compat shims in permanently.
mocker.patch problemdef test_process(mocker):
mock_validate = mocker.patch("orders.service._validate_order")
mock_save = mocker.patch("orders.service._save_order")
process(order)
mock_validate.assert_called_once_with(order)
mock_save.assert_called_once()
_validate_order was inlined into process. Test fails: Expected '_validate_order' to have been called once. Called 0 times.
The behavior didn't change — the order is still validated — but the test was asserting structure. Delete the mock assertion. The test should assert the outcome of validation:
def test_process_rejects_invalid():
bad = Order(items=[])
with pytest.raises(InvalidOrder, match="empty"):
process(bad)
This survives refactors because it tests what validation does, not that a function named _validate_order was called.
If using syrupy / pytest-snapshot:
pytest --snapshot-update
This updates all failing snapshots to current output. Dangerous if any failure is a regression. Workflow:
--snapshot-update. Read every diff.--snapshot-update.assert invoice.total == Decimal("27.80") # fails: actual 27.55
git log -p -- src/pricing.py → "Fix: half-even rounding." Intentional. Update with reason:
# abc123: half-even rounding fix. Was 27.80 (half-up bug).
assert invoice.total == Decimal("27.55")
Versus:
assert len(results) == 5 # fails: actual 4
The change was to a completely different module. Why are there fewer results? Regression. Don't update. Investigate.
pytest.approx driftassert score == 0.8472819 # now fails: 0.8472820
Last digit changed — floating-point ops reordered. If precision to 7 decimals isn't spec'd, this was over-tight:
assert score == pytest.approx(0.8473, rel=1e-4)
This isn't "loosening to make it pass." It's fixing an over-specific assertion that should never have been that tight.
--snapshot-update. Review diffs first. A regression in a snapshot looks identical to an intentional change.mocker.patch("module._private") failures by updating the patch path. You're chasing implementation. Replace with behavioral assertions.pytest.approx tolerance until the test passes. If rel=0.5 is what it takes, the test isn't testing anything.conftest.py after the migration. They mask further API drift.## Failing tests
Total: <N> Import/Attribute: <N> Signature: <N> Assertion: <N> Mock: <N> Snapshot: <N>
## Mechanical fixes
| Test | Error | Fix |
| ---- | ----- | --- |
## Mock decoupling
| Test | Over-coupled patch | Replacement assertion |
| ---- | ------------------ | --------------------- |
## Assertion triage
| Test | Old | New | Cause commit | Classification | Action |
| ---- | --- | --- | ------------ | -------------- | ------ |
## Snapshot review
| Snapshot | Diff summary | Intentional? |
| -------- | ------------ | ------------ |
## Regressions
<tests correctly failing — file bugs, don't update>
## After
Passing: <N> Updated: <N> Decoupled: <N> Deleted: <N> Bugs filed: <N>
development
Extracts human-readable pseudocode from a verified formal artifact (Dafny, Lean, TLA+) while preserving the verified properties as annotations, so the proof-carrying logic can be reimplemented in a production language. Use when porting verified code to an unverified target, when documenting what a formal spec actually does, or when handing a verified algorithm to an implementer.
development
Translates natural-language or pseudocode descriptions of concurrent and distributed systems into TLA+ specifications ready for the TLC model checker. Identifies state variables, actions, type invariants, safety properties, and liveness properties from the description. Use when formalizing a protocol, when the user describes a distributed algorithm to verify, when designing a consensus or locking scheme, or when starting formal verification of a concurrent system.
testing
Reduces a TLA+ model so TLC can actually check it — shrinks constants, adds state constraints, abstracts data, or applies symmetry — when the state space is too large to enumerate. Use when TLC runs out of memory, when checking takes hours, or when a spec works at N=2 and you need confidence at larger scale.
development
TLA+-specific instance of model-guided repair — reads a TLC error trace, identifies the enabling condition that should have been false, strengthens the corresponding action, and maps the fix to source code. Use when TLC reports an invariant violation or deadlock and you have the code-to-TLA+ mapping from extraction.