Performance Pipeline

Two-phase performance team: triage from multiple angles in parallel, then dispatch the right specialist skill for each finding.

When to Use

After writing or modifying hot-path code
When benchmarks show unexpected regressions
Before merging performance-sensitive changes
When you need deeper analysis than a single /performance-analyzer pass
For systematic optimization of a module or subsystem

Invocation

/perf-pipeline [<target>]

No argument: analyze recently changed files in the working tree
File path or glob: analyze specific files or modules
--bench <name>: start from Criterion benchmark results
--crate <name>: analyze an entire crate's hot paths

Phase 1: Parallel Triage

Launch three diagnostic agents in parallel using the Agent tool. Each agent approaches performance diagnosis from a different angle.

Agent A — Benchmark Triage

Analyze Criterion benchmark data for regressions, outliers, and trends.

Agent prompt template:

You are a Rust performance analyst specializing in benchmark interpretation.
Analyze the target code and any available Criterion benchmark results.

Look for:
- Statistical regressions (>5% median change)
- High variance indicating measurement instability
- Outlier samples suggesting GC pressure or system noise
- Benchmark gaps (hot code paths with no benchmarks)
- Comparison opportunities (before/after data available)

For each finding, report:
- Category: benchmark-regression | benchmark-gap | measurement-instability
- Location: file:line or benchmark name
- Evidence: numbers, percentages, statistical significance
- Impact estimate: how much latency/throughput is affected
- Recommended next step: which tool or skill to use

Target: {target_description}

Run `cargo bench --bench <relevant> -- --list` to discover available benchmarks.
Run benchmarks if needed to gather data.

Agent B — Static Analysis

Analyze code patterns for performance anti-patterns without running anything.

Agent prompt template:

You are a Rust performance analyst specializing in static code analysis for
performance issues. Analyze the target code for anti-patterns.

Check for:

Memory & Allocation:
- Unnecessary allocations in loops (Vec, String, Box)
- Missing with_capacity() for known-size collections
- Cloning where borrowing would suffice
- Large structs passed by value

CPU & Cache:
- False sharing in concurrent data structures
- Cache-unfriendly access patterns (strided, random)
- Branch-heavy code amenable to branchless alternatives
- Missing #[inline] on small hot functions

Async & Concurrency:
- Blocking operations in async contexts
- Lock contention patterns
- Oversized futures
- Unnecessary Arc when ownership would work

Project-specific patterns:
- NONE_U32 = u32::MAX sentinels (avoid Option overhead)
- Allocation tier violations (HOT paths must be allocation-silent)
- ByteSlab/InlineVec/RingBuffer usage opportunities

For each finding, report:
- Category: allocation-hotspot | cache-hostile | lock-contention |
  async-blocking | codegen-issue | vectorization-opportunity
- Location: file:line
- Evidence: the specific code pattern
- Severity: Critical (measurable impact) | High (likely impact) |
  Medium (potential impact) | Low (minor)
- Recommended fix: actionable change with code sketch

Target: {target_description}

Agent C — Hotspot Detection

Pre-profiling heuristic scan for likely performance bottlenecks.

Agent prompt template:

You are a Rust performance analyst specializing in hotspot detection. Scan
the target code to find functions and code paths most likely to be performance
bottlenecks, without running profilers.

Heuristics:
- Loop nesting depth and iteration counts
- Allocation density (allocs per iteration)
- Call graph depth in hot paths
- Data structure choice vs access pattern mismatch
- Serialization/deserialization in request paths
- Redundant computation (same value computed multiple times)
- Missed opportunities for short-circuit evaluation

For each hotspot, report:
- Risk level: High | Medium | Low
- Location: file:line (function name)
- Why it's likely hot: evidence from code structure
- Impact estimate: order-of-magnitude guess
- Ease of fix: Easy | Medium | Hard
- Recommended Phase 2 skill:
  * /heap-profile — for allocation attribution
  * /simd-optimize — for vectorizable loops
  * /asm-forge — for codegen quality issues
  * /bench-compare — for before/after measurement
  * /perf-topdown — for CPU microarchitecture bottlenecks
  * /pgo-bolt — for binary layout optimization
  * /causal-profile — for critical-path ambiguity
  * /linux-perf-profile — for PMU counter evidence

Target: {target_description}

Synthesis & Classification

After all three agents complete, merge and classify findings:

Match by location: Group findings referencing the same function or file:line
Score convergence: Findings from multiple agents get elevated priority
Classify each finding into one of these categories:

| Category | Phase 2 Skill | Description | |----------|---------------|-------------| | allocation-hotspot | /heap-profile | Excessive heap allocations in hot path | | vectorization-opportunity | /simd-optimize | Loop pattern amenable to SIMD | | codegen-issue | /asm-forge | Missed optimization visible in assembly | | benchmark-regression | /bench-compare | Needs before/after measurement | | microarch-bottleneck | /perf-topdown | Cache misses, branch misprediction | | pgo-candidate | /pgo-bolt | Binary layout optimization opportunity | | critical-path-unclear | /causal-profile | Hot function may not be on critical path | | needs-pmu-data | /linux-perf-profile | Need hardware counter evidence | | general-optimization | /asm-forge | Default: assembly-guided optimization |

Tag Phase 2 type: Mark each finding as:
- Diagnostic (read-only): heap-profile, perf-topdown, linux-perf-profile, causal-profile
- Optimization (read-write): asm-forge, simd-optimize, pgo-bolt
- Measurement (read-only): bench-compare

Human Gate

Present findings to the user:

## Perf Pipeline — Phase 1 Complete

Found {N} performance findings across {M} files.

### Findings (ranked by impact + convergence)

| #  | Risk | Location               | Issue                        | Category              | Phase 2 Skill    | Type        |
|----|------|------------------------|------------------------------|-----------------------|------------------|-------------|
| 1  | High | src/engine/core.rs:42  | Vec alloc in per-claim loop  | allocation-hotspot    | /heap-profile    | Diagnostic  |
| 2  | High | src/shard/split.rs:88  | Branchless opportunity       | codegen-issue         | /asm-forge       | Optimization|
| 3  | Med  | src/stdx/inline_vec.rs | Loop amenable to NEON SIMD   | vectorization-opp     | /simd-optimize   | Optimization|
| 4  | Med  | bench: acquire_restore | 12% regression vs baseline   | benchmark-regression  | /bench-compare   | Measurement |

Approve all? Enter numbers to select, or modify skill assignments:

The user can:

Approve all: "all"
Select specific findings: "1,2,3"
Override skill assignment: "3 -> /asm-forge" (change recommended skill)
Skip: "none"

Phase 2: Targeted Execution

Dispatch Order

Diagnostic findings first (read-only) — these produce deeper data that may inform optimization decisions
Present diagnostic results — brief summary of what profiling found
Optimization findings (read-write) — these modify code, dispatched with file ownership boundaries
Measurement findings (read-only) — run after optimizations to validate

Agent Dispatch

For each approved finding, launch an Agent whose prompt embeds the relevant skill's methodology:

You are a performance specialist applying {skill_name} methodology.

Finding to address:
- Category: {category}
- Location: {file:line}
- Issue: {description}
- Evidence from triage: {evidence}

{Skill-specific methodology and checklist inlined here}

Files you own (only modify these): {file list}

After any code changes, run:
  cargo fmt --all && cargo check && cargo clippy --all-targets --all-features -- -D warnings

Parallel vs Sequential

Diagnostic agents (read-only) → all in parallel
Optimization agents on non-overlapping files → in parallel
Optimization agents on overlapping files → sequential
Measurement agents → after optimizations complete

Feedback Loop

If diagnostic Phase 2 agents (heap-profile, perf-topdown, etc.) produce findings that change the optimization picture, present an intermediate gate:

## Perf Pipeline — Diagnostic Phase 2 Complete

/heap-profile found: Top allocator is `ShardMap::resize` at 4.2MB/s
/perf-topdown found: 38% of cycles are backend-bound (L3 cache misses)

Updated recommendations:
| # | Location | Original Skill | Updated Skill | Reason |
|---|----------|---------------|---------------|--------|
| 2 | split.rs:88 | /asm-forge | /simd-optimize | Cache-line alignment more impactful |

Proceed with updated plan? Or modify:

Completion

## Perf Pipeline — Complete

### Results

| Finding | Phase 2 Skill | Status    | Result                              |
|---------|---------------|-----------|-------------------------------------|
| #1      | /heap-profile | Diagnosed | ShardMap::resize is top allocator   |
| #2      | /asm-forge    | Optimized | Eliminated branch in split loop     |
| #3      | /simd-optimize| Optimized | NEON vectorized InlineVec scan      |
| #4      | /bench-compare| Measured  | 8% improvement vs baseline          |

### Verification

Run to confirm:
  cargo fmt --all && cargo check && cargo clippy --all-targets --all-features -- -D warnings
  cargo bench --bench <relevant>

Error Handling

If a Phase 1 agent fails, proceed with the other agents' findings
If a Phase 2 optimization makes cargo check fail, revert and report
If bench-compare shows regression after optimization, flag for user review

Related Skills

/rust-perf-triage — Phase 1 methodology (benchmark data)
/performance-analyzer — Phase 1 methodology (static analysis)
/rust-hotspot-finder — Phase 1 methodology (hotspot heuristics)
/heap-profile /simd-optimize /asm-forge /bench-compare /perf-topdown /pgo-bolt /causal-profile /linux-perf-profile — Phase 2 specialists
/review-pipeline — Code quality team pipeline
/test-pipeline — Testing team pipeline

Performance Pipeline

Two-phase performance team: triage from multiple angles in parallel, then dispatch the right specialist skill for each finding.

When to Use

After writing or modifying hot-path code
When benchmarks show unexpected regressions
Before merging performance-sensitive changes
When you need deeper analysis than a single /performance-analyzer pass
For systematic optimization of a module or subsystem

Invocation

/perf-pipeline [<target>]

No argument: analyze recently changed files in the working tree
File path or glob: analyze specific files or modules
--bench <name>: start from Criterion benchmark results
--crate <name>: analyze an entire crate's hot paths

Phase 1: Parallel Triage

Launch three diagnostic agents in parallel using the Agent tool. Each agent approaches performance diagnosis from a different angle.

Agent A — Benchmark Triage

Analyze Criterion benchmark data for regressions, outliers, and trends.

Agent prompt template:

You are a Rust performance analyst specializing in benchmark interpretation.
Analyze the target code and any available Criterion benchmark results.

Look for:
- Statistical regressions (>5% median change)
- High variance indicating measurement instability
- Outlier samples suggesting GC pressure or system noise
- Benchmark gaps (hot code paths with no benchmarks)
- Comparison opportunities (before/after data available)

For each finding, report:
- Category: benchmark-regression | benchmark-gap | measurement-instability
- Location: file:line or benchmark name
- Evidence: numbers, percentages, statistical significance
- Impact estimate: how much latency/throughput is affected
- Recommended next step: which tool or skill to use

Target: {target_description}

Run `cargo bench --bench <relevant> -- --list` to discover available benchmarks.
Run benchmarks if needed to gather data.

Agent B — Static Analysis

Analyze code patterns for performance anti-patterns without running anything.

Agent prompt template:

You are a Rust performance analyst specializing in static code analysis for
performance issues. Analyze the target code for anti-patterns.

Check for:

Memory & Allocation:
- Unnecessary allocations in loops (Vec, String, Box)
- Missing with_capacity() for known-size collections
- Cloning where borrowing would suffice
- Large structs passed by value

CPU & Cache:
- False sharing in concurrent data structures
- Cache-unfriendly access patterns (strided, random)
- Branch-heavy code amenable to branchless alternatives
- Missing #[inline] on small hot functions

Async & Concurrency:
- Blocking operations in async contexts
- Lock contention patterns
- Oversized futures
- Unnecessary Arc when ownership would work

Project-specific patterns:
- NONE_U32 = u32::MAX sentinels (avoid Option overhead)
- Allocation tier violations (HOT paths must be allocation-silent)
- ByteSlab/InlineVec/RingBuffer usage opportunities

For each finding, report:
- Category: allocation-hotspot | cache-hostile | lock-contention |
  async-blocking | codegen-issue | vectorization-opportunity
- Location: file:line
- Evidence: the specific code pattern
- Severity: Critical (measurable impact) | High (likely impact) |
  Medium (potential impact) | Low (minor)
- Recommended fix: actionable change with code sketch

Target: {target_description}

Agent C — Hotspot Detection

Pre-profiling heuristic scan for likely performance bottlenecks.

Agent prompt template:

You are a Rust performance analyst specializing in hotspot detection. Scan
the target code to find functions and code paths most likely to be performance
bottlenecks, without running profilers.

Heuristics:
- Loop nesting depth and iteration counts
- Allocation density (allocs per iteration)
- Call graph depth in hot paths
- Data structure choice vs access pattern mismatch
- Serialization/deserialization in request paths
- Redundant computation (same value computed multiple times)
- Missed opportunities for short-circuit evaluation

For each hotspot, report:
- Risk level: High | Medium | Low
- Location: file:line (function name)
- Why it's likely hot: evidence from code structure
- Impact estimate: order-of-magnitude guess
- Ease of fix: Easy | Medium | Hard
- Recommended Phase 2 skill:
  * /heap-profile — for allocation attribution
  * /simd-optimize — for vectorizable loops
  * /asm-forge — for codegen quality issues
  * /bench-compare — for before/after measurement
  * /perf-topdown — for CPU microarchitecture bottlenecks
  * /pgo-bolt — for binary layout optimization
  * /causal-profile — for critical-path ambiguity
  * /linux-perf-profile — for PMU counter evidence

Target: {target_description}

Synthesis & Classification

After all three agents complete, merge and classify findings:

Match by location: Group findings referencing the same function or file:line
Score convergence: Findings from multiple agents get elevated priority
Classify each finding into one of these categories:

Tag Phase 2 type: Mark each finding as:
- Diagnostic (read-only): heap-profile, perf-topdown, linux-perf-profile, causal-profile
- Optimization (read-write): asm-forge, simd-optimize, pgo-bolt
- Measurement (read-only): bench-compare

Human Gate

Present findings to the user:

## Perf Pipeline — Phase 1 Complete

Found {N} performance findings across {M} files.

### Findings (ranked by impact + convergence)

| #  | Risk | Location               | Issue                        | Category              | Phase 2 Skill    | Type        |
|----|------|------------------------|------------------------------|-----------------------|------------------|-------------|
| 1  | High | src/engine/core.rs:42  | Vec alloc in per-claim loop  | allocation-hotspot    | /heap-profile    | Diagnostic  |
| 2  | High | src/shard/split.rs:88  | Branchless opportunity       | codegen-issue         | /asm-forge       | Optimization|
| 3  | Med  | src/stdx/inline_vec.rs | Loop amenable to NEON SIMD   | vectorization-opp     | /simd-optimize   | Optimization|
| 4  | Med  | bench: acquire_restore | 12% regression vs baseline   | benchmark-regression  | /bench-compare   | Measurement |

Approve all? Enter numbers to select, or modify skill assignments:

The user can:

Approve all: "all"
Select specific findings: "1,2,3"
Override skill assignment: "3 -> /asm-forge" (change recommended skill)
Skip: "none"

Phase 2: Targeted Execution

Dispatch Order

Diagnostic findings first (read-only) — these produce deeper data that may inform optimization decisions
Present diagnostic results — brief summary of what profiling found
Optimization findings (read-write) — these modify code, dispatched with file ownership boundaries
Measurement findings (read-only) — run after optimizations to validate

Agent Dispatch

For each approved finding, launch an Agent whose prompt embeds the relevant skill's methodology:

You are a performance specialist applying {skill_name} methodology.

Finding to address:
- Category: {category}
- Location: {file:line}
- Issue: {description}
- Evidence from triage: {evidence}

{Skill-specific methodology and checklist inlined here}

Files you own (only modify these): {file list}

After any code changes, run:
  cargo fmt --all && cargo check && cargo clippy --all-targets --all-features -- -D warnings

Parallel vs Sequential

Diagnostic agents (read-only) → all in parallel
Optimization agents on non-overlapping files → in parallel
Optimization agents on overlapping files → sequential
Measurement agents → after optimizations complete

Feedback Loop

If diagnostic Phase 2 agents (heap-profile, perf-topdown, etc.) produce findings that change the optimization picture, present an intermediate gate:

## Perf Pipeline — Diagnostic Phase 2 Complete

/heap-profile found: Top allocator is `ShardMap::resize` at 4.2MB/s
/perf-topdown found: 38% of cycles are backend-bound (L3 cache misses)

Updated recommendations:
| # | Location | Original Skill | Updated Skill | Reason |
|---|----------|---------------|---------------|--------|
| 2 | split.rs:88 | /asm-forge | /simd-optimize | Cache-line alignment more impactful |

Proceed with updated plan? Or modify:

Completion

## Perf Pipeline — Complete

### Results

| Finding | Phase 2 Skill | Status    | Result                              |
|---------|---------------|-----------|-------------------------------------|
| #1      | /heap-profile | Diagnosed | ShardMap::resize is top allocator   |
| #2      | /asm-forge    | Optimized | Eliminated branch in split loop     |
| #3      | /simd-optimize| Optimized | NEON vectorized InlineVec scan      |
| #4      | /bench-compare| Measured  | 8% improvement vs baseline          |

### Verification

Run to confirm:
  cargo fmt --all && cargo check && cargo clippy --all-targets --all-features -- -D warnings
  cargo bench --bench <relevant>

Error Handling

If a Phase 1 agent fails, proceed with the other agents' findings
If a Phase 2 optimization makes cargo check fail, revert and report
If bench-compare shows regression after optimization, flag for user review

Related Skills

/rust-perf-triage — Phase 1 methodology (benchmark data)
/performance-analyzer — Phase 1 methodology (static analysis)
/rust-hotspot-finder — Phase 1 methodology (hotspot heuristics)
/heap-profile /simd-optimize /asm-forge /bench-compare /perf-topdown /pgo-bolt /causal-profile /linux-perf-profile — Phase 2 specialists
/review-pipeline — Code quality team pipeline
/test-pipeline — Testing team pipeline

Adoption

ahrav/perf-pipeline

$ install --global

Security Scan Results

SKILL.md

Performance Pipeline

When to Use

Invocation

Phase 1: Parallel Triage

Agent A — Benchmark Triage

Agent B — Static Analysis

Agent C — Hotspot Detection

Synthesis & Classification

Human Gate

Phase 2: Targeted Execution

Dispatch Order

Agent Dispatch

Parallel vs Sequential

Feedback Loop

Completion

Error Handling

Related Skills

Related Skills

ahrav/first-principles

ahrav/task-forge

ahrav/.claude/skills/autoresearch

ahrav/test-pipeline

ahrav/perf-pipeline

$ install --global

Security Scan Results

SKILL.md

Performance Pipeline

When to Use

Invocation

Phase 1: Parallel Triage

Agent A — Benchmark Triage

Agent B — Static Analysis

Agent C — Hotspot Detection

Synthesis & Classification

Human Gate

Phase 2: Targeted Execution

Dispatch Order

Agent Dispatch

Parallel vs Sequential

Feedback Loop

Completion

Error Handling

Related Skills

Related Skills

ahrav/first-principles

ahrav/task-forge

ahrav/.claude/skills/autoresearch

ahrav/test-pipeline