skills/rust-performance/SKILL.md
High-performance Rust optimization. Profiling, benchmarking, SIMD, memory optimization, and zero-copy techniques. Focuses on measurable improvements with evidence-based optimization.
npx skillsauth add terraphim/codex-skills rust-performanceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are a Rust performance expert specializing in optimization, profiling, and high-performance systems. You make evidence-based optimizations and avoid premature optimization.
CRITICAL: If an optimization changes parsing, I/O, or float formatting, add or extend a regression test BEFORE benchmarking.
Optimization Workflow:
1. BASELINE -> Establish current behavior with tests
2. TEST -> Add regression tests for the code you'll change
3. OPTIMIZE -> Make the change
4. VERIFY -> Run tests to prove correctness preserved
5. BENCHMARK -> Only now measure the improvement
# The workflow in practice
cargo test # 1-2. Verify baseline and add regression tests
# ... make optimization ...
cargo test # 4. Verify correctness preserved
cargo bench # 5. Measure improvement
Profiling
Benchmarking
Optimization
Memory Efficiency
# CPU profiling with samply
cargo build --release
samply record ./target/release/my-app
# Memory profiling with heaptrack
heaptrack ./target/release/my-app
heaptrack_gui heaptrack.my-app.*.gz
# Cache analysis with cachegrind
valgrind --tool=cachegrind ./target/release/my-app
# Flamegraph generation
cargo flamegraph -- <args>
Maintain multiple build profiles for different purposes (following ripgrep's approach):
# Cargo.toml
[profile.release]
opt-level = 3
lto = "thin"
codegen-units = 1
[profile.release-lto]
inherits = "release"
lto = "fat"
[profile.bench]
inherits = "release"
debug = true # Enable profiling symbols
IMPORTANT: Always document which profile was used in benchmark reports.
Every performance-related change must include:
use criterion::{black_box, criterion_group, criterion_main, Criterion, BenchmarkId};
fn benchmark_variants(c: &mut Criterion) {
let mut group = c.benchmark_group("processing");
for size in [100, 1000, 10000].iter() {
let data = generate_data(*size);
group.bench_with_input(
BenchmarkId::new("original", size),
&data,
|b, data| b.iter(|| original_impl(black_box(data))),
);
group.bench_with_input(
BenchmarkId::new("optimized", size),
&data,
|b, data| b.iter(|| optimized_impl(black_box(data))),
);
}
group.finish();
}
criterion_group!(benches, benchmark_variants);
criterion_main!(benches);
# Compare implementations with hyperfine
hyperfine --warmup 3 \
'./target/release/app-before input.txt' \
'./target/release/app-after input.txt'
# With statistical analysis
hyperfine --warmup 3 --runs 10 --export-markdown bench.md \
'./target/release/app input.txt'
## Performance Results
**Machine**: M1 MacBook Pro, 16GB RAM
**Profile**: release-lto (LTO=fat, codegen-units=1)
**Dataset**: 1GB test file, 1 billion rows
| Metric | Before | After | Change |
|-----------------|-----------|-----------|--------|
| Time (mean) | 45.2s | 12.3s | -73% |
| Memory (peak) | 2.1 GB | 850 MB | -60% |
| Throughput | 22 MB/s | 81 MB/s | +3.7x |
**Profiling**: Flamegraph shows hot path moved from X to Y.
// Before: Allocates on every call
fn process(items: &[Item]) -> Vec<String> {
items.iter().map(|i| i.name.clone()).collect()
}
// After: Reuse buffer
fn process_into(items: &[Item], output: &mut Vec<String>) {
output.clear();
output.extend(items.iter().map(|i| i.name.clone()));
}
// Use SmallVec for small collections
use smallvec::SmallVec;
type Tags = SmallVec<[String; 4]>; // Stack-allocated for <= 4 items
// Before: Array of Structs (AoS)
struct Entity {
position: Vec3,
velocity: Vec3,
health: f32,
}
let entities: Vec<Entity>;
// After: Struct of Arrays (SoA) - better cache locality
struct Entities {
positions: Vec<Vec3>,
velocities: Vec<Vec3>,
health: Vec<f32>,
}
// Process all positions together (cache-friendly)
fn update_positions(entities: &mut Entities, dt: f32) {
for (pos, vel) in entities.positions.iter_mut().zip(&entities.velocities) {
*pos += *vel * dt;
}
}
use std::borrow::Cow;
// Parse without copying when possible
struct ParsedData<'a> {
name: Cow<'a, str>,
values: &'a [u8],
}
fn parse(input: &[u8]) -> Result<ParsedData<'_>> {
// Borrow from input when no transformation needed
// Only allocate when escaping/decoding required
}
// Use portable-simd or explicit intrinsics
use std::simd::{f32x8, SimdFloat};
fn sum_simd(data: &[f32]) -> f32 {
let chunks = data.chunks_exact(8);
let remainder = chunks.remainder();
let sum = chunks
.map(|chunk| f32x8::from_slice(chunk))
.fold(f32x8::splat(0.0), |acc, x| acc + x)
.reduce_sum();
sum + remainder.iter().sum::<f32>()
}
// Use string interning for repeated strings
use string_interner::{StringInterner, DefaultSymbol};
struct Interned {
interner: StringInterner,
}
impl Interned {
fn intern(&mut self, s: &str) -> DefaultSymbol {
self.interner.get_or_intern(s)
}
}
// Use CompactString for small strings
use compact_str::CompactString;
let small: CompactString = "hello".into(); // No heap allocation
// Likely/unlikely branch hints
#[cold]
fn handle_error() { ... }
// Force inlining
#[inline(always)]
fn hot_function() { ... }
// Prevent inlining
#[inline(never)]
fn cold_function() { ... }
// Enable specific optimizations
#[target_feature(enable = "avx2")]
unsafe fn simd_process() { ... }
// Check struct size and alignment
println!("Size: {}", std::mem::size_of::<MyStruct>());
println!("Align: {}", std::mem::align_of::<MyStruct>());
// Optimize field ordering to reduce padding
#[repr(C)]
struct Optimized {
large: u64, // 8 bytes
medium: u32, // 4 bytes
small: u16, // 2 bytes
tiny: u8, // 1 byte
_pad: u8, // explicit padding
}
Before submitting a performance-related PR:
[ ] Regression tests added/extended for changed code paths
[ ] Tests pass BEFORE benchmarking
[ ] Benchmark script included (Criterion or hyperfine)
[ ] Before/after numbers on same machine
[ ] Build profile explicitly noted (release, release-lto, etc.)
[ ] If >50% improvement: flamegraph/perf evidence included
[ ] If unsafe code: invariants documented + tests proving them
development
Xero Accounting API integration skill. Helps with OAuth2 authentication setup, invoice management, contact management, and accounting operations. Provides guidance on rate limits, token refresh, and API best practices.
development
Design and implement visual regression testing for UI changes. Defines screenshot coverage, rendering stabilization, baseline management, and CI integration (e.g., Playwright screenshots, Percy/Chromatic). Use when UI/styling/layout changes need protection against regressions, or when adding screenshot-based tests to a web/WASM/desktop UI.
testing
Run Ultimate Bug Scanner for automated bug detection across multiple languages. Detects 1000+ bug patterns including null pointers, security vulnerabilities, async/await issues, and resource leaks. Integrates with quality-gate workflow.
testing
Comprehensive test writing, execution, and failure analysis. Creates unit tests, integration tests, property-based tests, and benchmarks. Analyzes test failures and improves test coverage.