skills/implement-math/SKILL.md
Translate math (formulas, estimators, algorithms) into code so the implementation faithfully matches what the source actually specifies. Use when writing code from a formula, reviewing an LLM-generated implementation of a formula, debugging a numerical mismatch with a paper, designing a new metric/estimator, or refactoring an existing math-heavy computation. Especially load-bearing whenever aggregation operators (sums, means, expectations, products, geometric means) appear over indices that can be reordered, or whenever the same English label can refer to multiple non-equivalent estimators (e.g. ratio-of-means vs mean-of-ratios, micro-average vs macro-average, sample-weighted vs unweighted). Prevents the failure mode where a code path silently implements the wrong estimator under the same name as the intended one.
npx skillsauth add AMindToThink/claude-code-settings implement-mathInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Core rule: Compact mathematical notation hides aggregation order. Your job is to make it visible — first in writing, then in code, then in tests.
Mathematical formulas in papers and proposals are nearly always under-specified about operator order. The same expression can map to multiple inequivalent code paths depending on where each summation, mean, division, or expectation actually lives in the loop. The bugs this introduces are silent: numbers come out, plots look reasonable, only careful comparison against ground truth reveals the drift.
A real, ~6-week-old bug from this codebase:
The paper's primary metric was specified as D = C × a_n (per-byte). The implementation computed a_n_per_byte two different ways inside one function:
mean_σ(bits_σ[k] / bytes_σ[k]) — mean-of-ratios (used internally for E_rate)mean_σ(bits_σ[k]) / mean_σ(bytes_σ[k]) — ratio-of-means (used for the headline a_n_per_byte and the published D)Both were stored under the same dict key a_k_curve_per_byte. Both are reasonable estimators. Only one is what the paper described. The bug was discovered, a separate diagnostic script (scripts/analyze_per_byte.py) was written to compute the correct estimator from per-permutation logs, and the canonical core.py was never patched. Six weeks later, every analysis script reading metrics["a_k_curve_per_byte"] got the wrong number, and the only flag was a docstring on a downstream helper.
This skill is the discipline that would have caught it on the day the formula was first translated into code.
Trigger this skill any time math meets code:
Especially trigger it when:
Do NOT skip this skill because "the formula is simple." Simple formulas hide order ambiguity especially well.
Before writing code, expand the formula until aggregation order is visible. Compact notation is the enemy.
Bad (the source paper's notation):
D = C × a_n (per-byte)
Good (what you actually have to implement):
a_n_pb := (1 / |Σ|) · Σ_{σ ∈ Σ} ( -log_2 P_θ(r_{σ(n)} | r_{σ(<n)}, p) / ‖r_{σ(n)}‖ )
↑ ↑ ↑
outer mean inner per-permutation per-permutation byte denominator
over orderings surprise (total bits) (specific to this perm's slot-n response)
D_per_byte := C × a_n_pb
This is verbose. That's the point. The verbosity is what carries the information that compact notation discards.
If the source's notation is genuinely ambiguous about aggregation order — common in papers where the math is written compactly to fit a column — the paper has a bug, not just the code. Decide explicitly which order you are implementing, document it inline, and (if applicable) flag the ambiguity to the user.
When two estimators of the same quantity exist or could plausibly exist in the same codebase, give them distinct names that encode the aggregation choice. Never reuse one name for two non-equivalent estimators.
Bad: a_n_per_byte used for two different things.
Good: a_n_pb_MoR and a_n_pb_RoM, with the dispreferred one only ever appearing if explicitly requested.
This pattern recurs throughout statistics:
accuracy_micro vs accuracy_macrof1_micro vs f1_macro vs f1_weightedppl_per_token vs ppl_per_byte vs ppl_per_wordloss_sum vs loss_mean vs loss_per_exampleD_C_an vs D_C_E (two different formulas for "diversity" in this codebase — see CLAUDE.md)If you find a single name being used for two different aggregations, rename before fixing. The rename is the fix; the aggregation correction is downstream of having a name to attach the corrected behavior to.
Property tests must use cases where the candidate estimators give numerically different answers. Toy examples with equal-length / equal-weight inputs are useless — they pass for any reasonable aggregation order, so they don't certify which one you implemented.
Construct a case where the candidates must disagree:
def test_mor_diverges_from_rom_when_uncorrelated():
"""Where bits and bytes vary inversely across perms, MoR ≠ RoM."""
rec = {
"a_k_curve": [55.0, 55.0], # mean of [100, 10]
"a_k_byte_counts": [55, 55], # mean of [10, 100]
"coherence_C": 1.0,
"per_permutation_a_k_curves": [
[100.0, 100.0], # 100 bits paired with 10 bytes → 10/byte
[10.0, 10.0], # 10 bits paired with 100 bytes → 0.1/byte
],
"per_permutation_byte_counts": [
[10, 10],
[100, 100],
],
}
v = compute_variants(rec)
# MoR = mean([100/10, 10/100]) = mean([10.0, 0.1]) = 5.05
assert v.a_n_pb_MoR == pytest.approx(5.05)
# RoM = a_k_curve[-1] / a_k_byte_counts[-1] = 55/55 = 1.0
assert v.a_n_pb_RoM == pytest.approx(1.0)
Generic version: for any aggregation, find an input where the two candidate orders give different answers, and assert the one you intended.
If the test passes for both candidates, the test is useless — strengthen it.
Each function that implements a non-trivial math formula gets a docstring with two parts:
def compute_a_n_per_byte_MoR(perm_curves, perm_bytes):
"""Mean-of-ratios per-byte progressive surprise at the last position.
Formula:
a_n_pb_MoR = (1/|Σ|) · Σ_σ (a_{σ,n}^bits / ‖r_{σ(n)}‖)
Code correspondence:
- Σ_σ ... (sum over orderings): ``sum(... for cur, b in zip(...))`` (line below)
- 1/|Σ| (outer mean): ``/ P``
- per-permutation ratio: ``cur[-1] / b[-1]`` inside the sum
"""
P = len(perm_curves)
return sum(cur[-1] / b[-1] for cur, b in zip(perm_curves, perm_bytes)) / P
The line-by-line correspondence is the part that catches reorderings. If you can't write it cleanly, the implementation isn't faithful to the formula — refactor until the correspondence is one-to-one.
This pattern is the math version of import-content's "the document references the script's output by name" rule: the formula has one source of truth (the docstring), and the code's job is to be a transparent translation of it.
Before merging or claiming "done," ask explicitly — to yourself, to the LLM, or to a human reviewer:
"For each summation / expectation / product / mean in the formula, point to the line of code that performs it, and confirm the order of operations matches the formula."
Do this in the diff. If the answer is "the order doesn't matter here" or "it does both at the same time," that's a smell — verify by constructing an input where order DOES matter and confirming the code's output agrees with your intended formula on that input.
When prompting an LLM to implement a formula, append: "In your implementation, point to the line of code that performs each aggregation operator in the formula, in order. If you cannot, the implementation is not faithful — say so and ask for clarification before writing code."
When you find a discrepancy between a formula and a downstream computation, fix the canonical implementation, not just the analysis script that hit the bug.
If the canonical fix would invalidate saved data that's expensive to regenerate, at minimum:
A "skilled fix" that lives only in analyze_per_byte.py and not in core.py will be silently bypassed by the next code path that reads from core.py's output — and there will always be a next code path. The fix has to be discoverable from either side.
Before writing a new implementation of a formula, grep for existing implementations of the same quantity. There almost always is one. Two outcomes:
If you're translating math into code and you can't disambiguate aggregation order from the source material, stop and ask the user before writing the code.
"What aggregation order do you intend for X — mean-of-ratios across permutations, or ratio-of-means? They differ when the per-permutation bits and byte counts are correlated."
The user can answer in five seconds. Debugging the resulting silent bug takes five weeks. The cost of one clarifying question is always less than the cost of one wrong implementation that ships.
/n) vs sample variance (/(n-1)) (np.std(ddof=0) vs ddof=1)range(n) vs range(n+1), off-by-one)axis=0 vs axis=-1 is a one-character difference with very different semantics)For any of these: name encodes the choice, test distinguishes the candidates, docstring documents the formula, line-by-line correspondence to the code.
development
Use when the user asks to check, audit, or improve a website or web project for accessibility (a11y), WCAG compliance, screen reader support, keyboard navigation, color contrast, or alt text. Triggers a plan-mode investigation against the TeachAccess design and code checklists, then implements approved fixes.
development
--- name: make-anonymous-branch description: Use when preparing a research repo for double-blind submission via anonymous.4open.science (ICML/NeurIPS/ICLR/workshop). Builds a single `anon-submission` branch with code+data+paper, scrubs identity leaks (author names, home paths, emails, wandb metadata, PDF author fields), patches LaTeX for pdf.js compatibility, and leaves `main` untouched. Triggers: "make an anonymous branch", "anonymize my repo for X submission", "set up anonymous.4open.science",
development
Use when the user asks to review, find, summarize, or check Claude Code chat transcripts from a past date or time range ("review my chats from May 1st", "what was I working on yesterday", "any unfinished sessions this week"). Reads transcripts under `~/.claude/projects/`, handles local-time vs UTC correctly so late-evening sessions don't get dropped, and flags chats whose last assistant turn looks like an unanswered question.
documentation
Consolidate scattered research notes, logs, experiment outputs, and submodule docs into a single living research paper. Use when the user wants to pull together multiple source documents into one structured paper.