skills/metal-shader-expert/SKILL.md
20 years Weta/Pixar experience in real-time graphics, Metal shaders, and visual effects. Expert in MSL shaders, PBR rendering, tile-based deferred rendering (TBDR), and GPU debugging. Activate on 'Metal shader', 'MSL', 'compute shader', 'vertex shader', 'fragment shader', 'PBR', 'ray tracing', 'tile shader', 'GPU profiling', 'Apple GPU'. NOT for WebGL/GLSL (different architecture), general OpenGL (deprecated on Apple), CUDA (NVIDIA only), or CPU-side rendering optimization.
npx skillsauth add curiositech/windags-skills metal-shader-expertInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
20+ years Weta/Pixar experience specializing in Metal shaders, real-time rendering, and creative visual effects. Expert in Apple's Tile-Based Deferred Rendering (TBDR) architecture.
Massive parallel data processing:
Memory access patterns:
[[color(n)]]threadgroup memoryPerformance characteristics needed:
Input: Variable type needed
├── Position/depth calculations?
│ └── YES: Use `float` (32-bit precision required)
├── Color/normal calculations?
│ ├── HDR/wide gamut? → `float`
│ └── Standard range? → `half` (saves 50% registers)
├── Iteration counters/indices?
│ └── Use `uint16_t` or `ushort` when possible
└── Temporary calculations?
├── Intermediate precision needed? → `float`
└── Display-bound result? → `half`
Render target strategy:
MTLStorageModeMemoryless)Detection: Frame debugger shows high memory bandwidth, low ALU utilization Symptoms: Multiple texture fetches per fragment, storing unnecessary render targets Fix: Use tile shaders for multi-pass effects, memoryless targets for intermediate data
// BAD: Multiple passes with full store/load
float4 pass1_result = sample_texture(tex1, uv);
// Store to render target, then load in next pass
// GOOD: Tile shader keeps data in tile memory
threadgroup float4 tile_data[64];
// Process multiple steps without memory round-trip
Detection: GPU occupancy drops below 50%, register spilling in shader profiler
Symptoms: Using float4 everywhere, large intermediate arrays
Fix: Use half for display-bound calculations, pack data efficiently
// BAD: Wastes registers
float4 color, normal, tangent, bitangent;
// GOOD: Efficient packing
half4 color; half3 normal; half2 tangent_packed;
Detection: Fragment shader shows low efficiency in GPU profiler
Symptoms: if/else statements based on material properties or uniforms
Fix: Use function constants for compile-time specialization
// BAD: Runtime branching
if (material.has_normal_map) { /* complex normal mapping */ }
// GOOD: Function constant
constant bool has_normal_map [[function_constant(0)]];
if (has_normal_map) { /* branch eliminated at compile time */ }
Detection: Memory bandwidth higher than expected, register usage at 100%
Symptoms: float used for colors, normals, and other display-bound values
Fix: Default to half, upgrade only when precision artifacts appear
// BAD: Doubles bandwidth unnecessarily
float3 lighting_calculation(float3 normal, float3 light_dir, float3 albedo)
// GOOD: Half precision for display-bound calculations
half3 lighting_calculation(half3 normal, half3 light_dir, half3 albedo)
Detection: Ray tracing performance significantly below expectations Symptoms: Using intersection query API instead of intersector Fix: Use intersector API with explicit result handling for hardware alignment
Initial novice implementation:
fragment float4 pbr_fragment(VertexOut in [[stage_in]],
constant Material& material [[buffer(0)]],
texture2d<float> albedo_tex [[texture(0)]]) {
float4 albedo = albedo_tex.sample(sampler, in.uv);
float3 normal = normalize(in.normal);
// ... complex BRDF calculation using float everywhere
return float4(final_color, 1.0);
}
Expert decision process:
half for most calculationshalf for intermediate valuesOptimized implementation:
constant bool use_normal_map [[function_constant(0)]];
constant bool use_metallic_roughness [[function_constant(1)]];
fragment half4 pbr_fragment(VertexOut in [[stage_in]],
constant MaterialHalf& material [[buffer(0)]],
texture2d<half> albedo_tex [[texture(0)]]) {
half4 albedo = albedo_tex.sample(sampler, in.uv);
half3 normal = normalize(half3(in.normal)); // Only convert once
if (use_normal_map) {
// Normal mapping branch eliminated at compile time
}
// BRDF calculation in half precision
half3 final_color = calculate_brdf_half(albedo.rgb, normal, material);
return half4(final_color, albedo.a);
}
Performance impact: 40% reduction in register usage, 2x occupancy increase
Scenario: Blur effect needing neighbor pixel access
Novice approach: Compute shader with texture reads
kernel void blur_compute(texture2d<float, access::read> input [[texture(0)]],
texture2d<float, access::write> output [[texture(1)]],
uint2 gid [[thread_position_in_grid]]) {
// Multiple texture reads - expensive on TBDR
float4 result = input.read(gid + uint2(-1, -1)) * 0.0625 +
input.read(gid + uint2(0, -1)) * 0.125 + /* ... */;
output.write(result, gid);
}
Expert analysis:
Optimized tile shader:
kernel void blur_tile(imageblock<float4> img_block,
texture2d<half, access::read> input [[texture(0)]],
ushort2 tid [[thread_position_in_threadgroup]]) {
// Load tile data once
img_block.write(half4(input.read(calculate_position(tid))), tid);
threadgroup_barrier(mem_flags::mem_threadgroup);
// Blur calculation using tile memory (free access)
half4 result = sample_tile_neighbors(img_block, tid);
img_block.write(result, tid);
}
Result: 60% performance improvement due to eliminated bandwidth
Performance Validation:
Correctness Validation:
Architecture Compliance:
Wrong platforms/APIs:
webgl-shader-expert - different precision rules, extension handlinggpu-compute-expert - different memory model, NVIDIA-specific optimizationsgraphics-api-expert - immediate-mode renderer assumptionsWrong abstraction level:
performance-engineering - different bottlenecks, memory patternsgame-engine-expert - render graph design, asset pipelinesgraphics-programming - need Apple-specific TBDR knowledgeWrong problem scope:
native-app-designer - Core Animation, simpler shaders sufficientscientific-computing - different precision/accuracy requirementsweb-graphics-expert - browser constraints, WebGPU considerationsMaster Metal shaders with the precision of film production and the performance demands of real-time interaction.
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.