skills/code-quality/code-optimizer/SKILL.md
Optimizes code for performance by identifying the actual bottleneck, choosing the right optimization lever, and measuring the result. Use when a specific operation is too slow, when a profiler has pointed at a hot path, or when the user asks to make something faster.
npx skillsauth add santosomar/general-secure-coding-agent-skills code-optimizerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Making code fast is a measurement discipline, not a coding style. The first rule: you don't know where the time goes until you measure. The second rule: you're usually wrong about where you think it goes.
| Question | If no → stop | | ---------------------------------------------- | ------------------------------------------------ | | Is there a concrete, measured slowness? | "It feels slow" is not a measurement | | Is the slow path on a hot path? | A 10s function called once at startup is fine | | Is there a target? ("under 100ms p99") | Without a target, you don't know when to stop |
| What's slow | Tool |
| -------------------------- | -------------------------------------------------------------- |
| CPU-bound Python | py-spy, cProfile + snakeviz |
| CPU-bound JVM | async-profiler, JFR |
| CPU-bound native | perf, Instruments, vtune |
| Memory pressure / GC | Heap profiler (tracemalloc, jmap, heaptrack) |
| I/O-bound (DB, network) | Query logs, EXPLAIN ANALYZE, trace spans |
| Unclear | Flame graph first — it'll tell you which category |
Profile the real workload, not a toy. Micro-benchmarks lie.
Optimizations, ranked by typical payoff-to-effort:
| Lever | When it applies | Typical speedup | Effort |
| ---------------------------- | --------------------------------------------------- | --------------- | ------ |
| Do less work | You're computing things nobody uses | 10–100× | Low |
| Fix the algorithm | O(n²) where O(n log n) exists; nested loops over the same collection | 10–1000× | Medium |
| Cache / memoize | Same expensive call, same inputs, repeatedly | 2–100× | Low |
| Batch | N round-trips to a service → 1 round-trip | N× | Medium |
| Move out of the loop | Invariant computation inside a loop | iterations× | Trivial|
| Use the right data structure | list where you need set lookup; linear scan where you need index | 2–1000× | Low |
| Parallelize | Embarrassingly parallel work on a multi-core box | cores× | High |
| Go native / use SIMD | Tight numeric loop in an interpreted language | 10–100× | High |
| Micro-optimize | Unroll, inline, avoid allocations | 1.1–2× | High |
Start at the top. Micro-optimization is the last resort, not the first instinct.
Benchmark before → one change → benchmark after → record the delta. Every time. If you make three changes and it's faster, you don't know which one did it — and one of them probably made it slower.
Complaint: "Exporting the report takes 40 seconds."
Profile (py-spy top):
84% _lookup_user_name (report.py:67)
11% _format_row (report.py:80)
3% csv.writer.writerow
84% in one function. Look at it:
def _lookup_user_name(user_id):
return db.query("SELECT name FROM users WHERE id = ?", user_id).one()
def export(rows):
for row in rows: # 10,000 rows
row.user_name = _lookup_user_name(row.user_id)
writer.writerow(_format_row(row))
Diagnosis: N+1 query. 10,000 rows → 10,000 round-trips. Lever: batch.
def export(rows):
user_ids = {row.user_id for row in rows}
names = dict(db.query("SELECT id, name FROM users WHERE id IN ?", list(user_ids)))
for row in rows:
row.user_name = names[row.user_id]
writer.writerow(_format_row(row))
Measure: 40s → 0.6s. 67× speedup, one query instead of 10,000. No data structure changed, no parallelism, no C extension. Just: do less work.
→ behavior-preservation-checker — the IN query with a set dedupes user_ids; make sure that's equivalent (it is — we're populating a dict, dupes were redundant anyway).
_format_row 2× faster. It was 11% of runtime. Total speedup: 1.06×. The profiler told you to look at _lookup_user_name.multiprocessing, not threading.## Baseline
<metric> = <value> (measured with: <tool/command>)
## Bottleneck
<file>:<line> — <N>% of runtime
<why it's slow — the diagnosis>
## Change
<lever from table> — <one-sentence what>
<diff>
## Result
<metric> = <value> (<N>× speedup)
## Behavior check
<→ behavior-preservation-checker, or: tests green>
development
Extracts human-readable pseudocode from a verified formal artifact (Dafny, Lean, TLA+) while preserving the verified properties as annotations, so the proof-carrying logic can be reimplemented in a production language. Use when porting verified code to an unverified target, when documenting what a formal spec actually does, or when handing a verified algorithm to an implementer.
development
Translates natural-language or pseudocode descriptions of concurrent and distributed systems into TLA+ specifications ready for the TLC model checker. Identifies state variables, actions, type invariants, safety properties, and liveness properties from the description. Use when formalizing a protocol, when the user describes a distributed algorithm to verify, when designing a consensus or locking scheme, or when starting formal verification of a concurrent system.
testing
Reduces a TLA+ model so TLC can actually check it — shrinks constants, adds state constraints, abstracts data, or applies symmetry — when the state space is too large to enumerate. Use when TLC runs out of memory, when checking takes hours, or when a spec works at N=2 and you need confidence at larger scale.
development
TLA+-specific instance of model-guided repair — reads a TLC error trace, identifies the enabling condition that should have been false, strengthens the corresponding action, and maps the fix to source code. Use when TLC reports an invariant violation or deadlock and you have the code-to-TLA+ mapping from extraction.