.claude/skills/review-shaders/SKILL.md
Audit HLSL pixel shaders for GPU performance — math optimizations, ALU reduction, texture efficiency at 120-240fps
npx skillsauth add cwilliams5/Alt-Tabby review-shadersInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Enter planning mode. Deep-audit HLSL pixel shaders in src/shaders/ for GPU performance waste — redundant math, avoidable transcendentals, suboptimal patterns. Use maximum parallelism — spawn explore agents for independent shader groups. Every optimization must be visually identical to the original.
Alt-Tabby renders background shaders at 120-240fps via D3D11 pixel shaders compiled from HLSL to DXBC. These shaders are converted from Shadertoy GLSL — many carry unoptimized patterns from their original authors or from mechanical GLSL→HLSL conversion. At 240fps on a 1440p display, every pixel shader instruction runs ~885M times/second (3840×1600×240÷2 assuming half the overlay is visible). Even saving one ALU instruction matters.
Scope: All src/shaders/**/*.hlsl files (including mouse/ and selection/ subdirs). Includes both pixel shader (PSMain) and compute shader (CSMain) logic — mouse shaders may contain both in a single .hlsl file. Does NOT cover:
d2d_shader.ahk (use review-paint for the D2D pipeline)review-paint)Cardinal rule: Every optimization must produce visually identical output. These are aesthetic shaders — if you can't prove the math is equivalent, don't suggest the change. "Close enough" is not acceptable.
Transcendental functions (sin, cos, atan2, exp, log, pow) are the most expensive single instructions on the GPU. Look for:
sin/cos of the same angle: Replace with sincos(angle, s, c). HLSL's sincos intrinsic computes both in one operation.
// BEFORE — 2 transcendentals
float s = sin(angle);
float c = cos(angle);
// AFTER — 1 intrinsic
float s, c;
sincos(angle, s, c);
Repeated sin/cos calls with the same argument: Hoist to a local variable.
// BEFORE — sin(time * 0.1) computed 3 times
x += sin(time * 0.1) * 2.0;
y += sin(time * 0.1) * 3.0;
z += sin(time * 0.1);
// AFTER — computed once
float st = sin(time * 0.1);
x += st * 2.0;
y += st * 3.0;
z += st;
pow(x, 2.0): Replace with x * x. pow is a transcendental; multiply is ALU.
pow(x, 0.5): Replace with sqrt(x) — dedicated hardware unit.
pow(x, N) for small integer N: Expand manually (x*x*x for N=3).
exp(x * log(y)): This is just pow(y, x) — but check if the original intent was simpler.
normalize once, extract length via rsqrt if both are needed.Many Shadertoy shaders use loops for FBM noise, raymarching, or iterative effects:
total += noise(p) * amp; p = mul(p, m); amp *= decay; — ensure mul and noise calls can't be simplified.mul(v, M) vs mul(M, v): HLSL matrix multiplication order matters. Ensure the correct convention is used and no unnecessary transpose is happening.
Rotation matrices computed per-pixel: If the rotation angle is uniform (from cbuffer time), the matrix is constant across all pixels — move to a static const or compute once.
// BEFORE — per-pixel (wasteful if angle is uniform)
float2x2 rot = float2x2(cos(a), sin(a), -sin(a), cos(a));
// AFTER — single sincos + construction
float s, c;
sincos(a, s, c);
float2x2 rot = float2x2(c, s, -s, c);
Note: HLSL static local variables are computed once per draw call, not per-pixel.
Constructing float2x2/float3x3 from constants: Use static const so the compiler knows it's constant.
For shaders with iChannel textures:
x * 0.5 + 0.5: This is mad(x, 0.5, 0.5) — the compiler usually handles this, but explicit mad() is clearer intent.1.0 - (1.0 - x): Simplifies to x.a / b * c: If b is constant, use a * (c / b) — one multiply instead of divide + multiply. Division is expensive on GPU.length(v) * length(v): Use dot(v, v) — avoids the sqrt inside length.abs(x) * abs(x): Same as x * x.clamp(x, 0.0, 1.0): Use saturate(x) — free on most GPU hardware (modifier, not instruction).max(x, 0.0): Could be saturate(x) if upper bound doesn't matter, or leave as max.smoothstep(0.0, 1.0, x): Equivalent to x*x*(3.0 - 2.0*x) with saturate, but smoothstep is fine — compiler knows this.Mechanical GLSL→HLSL conversion can introduce waste:
fmod instead of frac: In GLSL, mod(x, 1.0) is idiomatic. The converter produces fmod(x, 1.0) but frac(x) is cheaper (single instruction, no divide)..xyz / float4(v3, 1.0) chains.(float3)x broadcast: Fine, but check if the original GLSL was doing something more specific.HLSL compilers are good but not perfect. Help them:
static const for values computed from other constants: ensures compile-time evaluation.3.14159265 is fine, but 3.14159265358979323846 wastes parser time with no precision gain in float (only 7 significant digits). Use 3.14159265 or better yet 3.14159265f.2.0 in float context is fine; 2 might cause an implicit cast in some contexts.O(maxParticles) loop inside each grid cell thread is the hottest compute code. Key optimizations: early-exit radius check (if (dist > maxRadius) continue;) before computing glow/color; skip dead particles (if (p.life >= 1.0) continue;) as first check; avoid redundant normalize.pos.xy = RG, vel.xy = BA). This is intentional for buffer reuse -- don't suggest cleanup.ceil(totalElements / 64) threads. The if (idx >= total) return; guard is necessary for non-multiple-of-64 buffer sizes -- don't flag as waste.gridW, gridH, maxParticles), not compile-time #define constants. Buffer allocation comes from _Shader_ComputeBufferLayout() in d2d_shader.ahk.reactivity (cbuffer float at offset 124). Reviews should verify new shaders multiply cursor-dependent forces by this value.All .hlsl files in src/shaders/ and subdirectories (mouse/, selection/). There are 150+ shaders. Organize the audit by pattern, not by individual file — many shaders share the same noise functions, FBM loops, and rotation patterns.
Use the .glsl source as reference when verifying visual equivalence — it shows the original author intent.
Split by optimization category (run in parallel):
.hlsl files for sin(, cos(, pow(, exp(, atan2(. Find paired sin/cos, repeated calls, pow with integer/simple exponents. Count frequency per file.for loops in .hlsl files. Check for loop-invariant hoisting opportunities, unnecessary iterations, per-pixel matrix construction inside loops.fmod(.*1.0), length(.*length(, clamp(.*0.0.*1.0), division by constants, 1.0 - (1.0 -, and other algebraic simplification opportunities..hlsl against .glsl for each shader. Find dead iMouse code, unnecessary type conversions, fmod that should be frac..glsl file next to each .hlsl is the original Shadertoy source for referencequery_interface.ps1 <file> — understand the public surface of D3D11 infrastructure files (d2d_shader.ahk, gui_effects.ahk) when tracing how shaders are loaded and composedGroup findings by pattern, not by individual file. Many shaders will share the same issue.
For each pattern:
| Pattern | Affected Shaders | Per-Pixel Cost Saved | Complexity | Fix |
|---------|-----------------|---------------------|------------|-----|
| Paired sin/cos → sincos | fire.hlsl:45, accretion.hlsl:23, +12 more | ~1 transcendental/pixel | One-line per site | sincos(angle, s, c) |
Columns:
Then for each affected file, list the specific locations as a sub-table or bullet list so implementation can be done methodically.
Do not filter. A single saved transcendental ×885M pixels/sec = real watts and real frame time. List everything.
Prove equivalence: For every suggested change, show the mathematical equivalence. sincos(a, s, c) produces exactly s=sin(a), c=cos(a) — trivially equivalent. x*x vs pow(x, 2.0) — equivalent for all finite float values. If equivalence requires assumptions (e.g., "x is non-negative"), state the assumption and verify it holds in context.
Check the GLSL original: If the HLSL already differs from the GLSL in a way that suggests intentional optimization during conversion, don't flag it again.
Don't break visual output: If unsure whether a simplification is visually identical, err on the side of not suggesting it. Mark uncertain cases as "needs visual verification."
Respect author intent: Some "inefficiencies" are intentional artistic choices (e.g., a specific pow curve for color grading). Don't optimize these away.
Section 1 — Transcendental Optimizations:
| Pattern | Affected Shaders | Per-Pixel Cost Saved | Complexity | Fix | |---------|-----------------|---------------------|------------|-----|
Section 2 — Loop Optimizations:
| Pattern | Affected Shaders | Per-Pixel Cost Saved | Complexity | Fix | |---------|-----------------|---------------------|------------|-----|
Section 3 — Algebraic Simplifications:
| Pattern | Affected Shaders | Per-Pixel Cost Saved | Complexity | Fix | |---------|-----------------|---------------------|------------|-----|
Section 4 — Conversion Artifacts:
| Pattern | Affected Shaders | Per-Pixel Cost Saved | Complexity | Fix | |---------|-----------------|---------------------|------------|-----|
Order within each section by total impact (per-pixel savings × number of affected shaders, highest first).
Ignore any existing plans — create a fresh one.
After implementing any shader changes, you MUST:
--live test suite — powershell -File tests/test.ps1 --live. This compiles the exe (which compiles HLSL→DXBC) and validates everything end-to-end.git status for changed .bin files in resources/shaders/. HLSL source changes produce new compiled DXBC binaries — these MUST be committed alongside the .hlsl changes.Shader changes without compiled bins are broken — the compiled exe embeds the .bin files, not the .hlsl source.
tools
Create a new git worktree and switch the session into it
tools
Spawn agent to trace code flow via query tools — answer only, no context cost
tools
Commit, push, and create a PR for the current branch
tools
Retire a shader by moving its files to legacy/shaders_retired