skills/from-pragmas-partners-symbiotic/SKILL.md
Agentic High-Level Synthesis (HLS) optimization: autonomously analyze, insert, and tune C/C++ HLS pragmas (pipeline, unroll, array_partition, dataflow) through closed-loop feedback with synthesis tools. Use when: 'optimize this HLS kernel', 'add pragmas to this C++ for FPGA', 'explore the design space for this Vitis HLS project', 'tune my hardware accelerator', 'analyze this HLS synthesis report', 'debug why my HLS design has poor throughput'.
npx skillsauth add ndpvt-web/arxiv-claude-skills from-pragmas-partners-symbioticInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to act as an agentic HLS optimization partner that analyzes C/C++ hardware kernels, systematically inserts and tunes synthesis pragmas, interprets HLS tool reports, and iteratively refines designs through a closed-loop feedback workflow. Based on the six-level autonomy taxonomy from Zhang et al. (2026), this skill operates at L1-L3: explaining synthesis reports with source attribution, suggesting pragma edits, running closed-loop autotuning over defined search spaces, and performing multi-step tool-guided optimization with verification.
#pragma HLS pipeline, unroll, array_partition, dataflow, inline, etc.)HLS as an agentic abstraction layer. High-Level Synthesis compiles C/C++ into RTL (Verilog/VHDL) for FPGAs. The key optimization lever is pragmas — compiler directives that control parallelism, pipelining, memory layout, and dataflow. The design space is combinatorial: a kernel with 5 loops and 3 arrays can have thousands of valid pragma configurations. Manual exploration is slow; agents excel here because HLS code is highly permutable — one pragma configuration can be swapped for another without changing functional semantics.
Mixed-fidelity feedback loop. Rather than running full synthesis (which takes minutes to hours) for every candidate, agents should use a three-tier evaluation strategy: (1) low-fidelity — static analysis of pragma compatibility and resource estimates, (2) medium-fidelity — C-simulation and lightweight scheduling analysis from HLS tool reports, and (3) high-fidelity — full synthesis and place-and-route for the most promising candidates. This reduces iteration cost dramatically.
Golden reference validation. Every optimization must be verified against a functional golden reference — the original un-optimized or known-correct design. Agents run C-simulation or co-simulation to confirm that pragma insertions preserve correctness before evaluating performance. This prevents the common failure mode where aggressive pragmas produce designs that synthesize but compute incorrect results.
Parse the HLS kernel. Read the C/C++ source and identify all synthesizable functions, loops (with trip counts), arrays (with dimensions and access patterns), and existing pragmas. Build a mental model of the compute and memory structure.
Establish the golden reference. If a testbench exists, run C-simulation (vitis_hls -f run_csim.tcl or equivalent) to capture expected outputs. If no testbench exists, create one with representative inputs and record outputs. This is the correctness baseline.
Profile the current design. Run HLS synthesis on the unoptimized design and extract key metrics from the report: total latency (cycles), initiation interval (II) per loop, resource utilization (LUT, FF, BRAM, DSP), and any warnings about unresolved dependencies or failed scheduling.
Identify bottlenecks. Map report metrics back to source code locations. Common bottlenecks: loops with II > 1 (caused by loop-carried dependencies or memory port conflicts), large latency from non-pipelined loops, excessive BRAM usage from non-partitioned arrays, and missing dataflow between sequential functions.
Generate pragma candidates. For each bottleneck, propose targeted pragmas:
#pragma HLS pipeline II=1 (and resolve the dependency causing II inflation)#pragma HLS unroll factor=N (choose N based on resource budget)#pragma HLS array_partition variable=arr type=cyclic factor=N#pragma HLS dataflow at the function level#pragma HLS inlineEvaluate candidates at low fidelity. Before synthesis, check for pragma conflicts: unrolling a loop by N requires N memory ports, so array partition factors must match. Verify that pipeline and dataflow pragmas are not applied to the same scope (they are mutually exclusive within a region). Estimate resource growth from unroll factors.
Synthesize promising configurations. Run HLS synthesis on the top 2-4 configurations. Compare latency, II, and resource utilization against the baseline. Record results in a structured format for comparison.
Verify correctness. For any configuration that improves performance, re-run C-simulation against the golden reference. If outputs diverge, discard the configuration and investigate whether the pragma exposed a latent bug or changed semantics.
Iterate or compose. If targets are not met, combine successful pragmas (e.g., pipeline + array partition), try alternative strategies (e.g., loop tiling instead of full unroll), or restructure the code (e.g., split a function to enable dataflow). Each iteration follows the same synthesize-verify cycle.
Report the Pareto frontier. Present the user with a table of configurations showing latency, throughput, and resource utilization. Recommend the configuration that best matches their constraints (e.g., "must fit within 80% of available BRAMs" or "minimize latency regardless of area").
Example 1: Optimizing a matrix multiplication kernel
User: "Optimize this HLS kernel for Vitis HLS — I need better throughput."
// matmul.cpp
void matmul(int A[64][64], int B[64][64], int C[64][64]) {
for (int i = 0; i < 64; i++)
for (int j = 0; j < 64; j++) {
int sum = 0;
for (int k = 0; k < 64; k++)
sum += A[i][k] * B[k][j];
C[i][j] = sum;
}
}
Approach:
k loop accumulates into a scalar — good candidate for pipelining.void matmul(int A[64][64], int B[64][64], int C[64][64]) {
#pragma HLS array_partition variable=A type=complete dim=2
#pragma HLS array_partition variable=B type=complete dim=1
for (int i = 0; i < 64; i++)
for (int j = 0; j < 64; j++) {
#pragma HLS pipeline II=1
int sum = 0;
for (int k = 0; k < 64; k++)
sum += A[i][k] * B[k][j];
C[i][j] = sum;
}
}
pipeline on the j loop flattens the inner k loop. Array partitioning on dim=2 of A and dim=1 of B provides enough ports for parallel reads.Output:
| Config | Latency (cycles) | II | BRAM | DSP | LUT |
|------------|-------------------|-----|------|-----|-------|
| Baseline | 262,144 | 1 | 6 | 3 | 1,200 |
| Optimized | 4,160 | 1 | 6 | 64 | 8,400 |
Example 2: Debugging II > 1 in a FIR filter
User: "My FIR filter has II=3 but I set pipeline II=1. Why?"
void fir(int input, int output[1], int coeffs[16], int shift_reg[16]) {
#pragma HLS pipeline II=1
shift_reg[0] = input;
for (int i = 15; i > 0; i--)
shift_reg[i] = shift_reg[i-1];
int acc = 0;
for (int i = 0; i < 16; i++)
acc += shift_reg[i] * coeffs[i];
output[0] = acc;
}
Approach:
shift_reg is accessed in a read-after-write pattern across the shift loop, and it defaults to a dual-port BRAM — only 2 read/write ports per cycle.shift_reg[i] = shift_reg[i-1] which creates a sequential dependency chain, and BRAM port limits cause II inflation.shift_reg completely so each element is a register.#pragma HLS array_partition variable=shift_reg type=complete
Output: "The II=3 is caused by BRAM port contention on shift_reg. The shift loop requires reading and writing multiple elements per cycle, but a BRAM only has 2 ports. Adding #pragma HLS array_partition variable=shift_reg type=complete converts it to registers, resolving the conflict. This costs 16 FFs instead of 1 BRAM block."
Example 3: Design space exploration for a convolution kernel
User: "Explore the design space for this 2D convolution — I have a budget of 200 DSPs and 100 BRAMs."
Approach:
Output:
| Unroll | Partition | Latency | DSP | BRAM | Feasible? |
|--------|-----------|----------|-----|------|-----------|
| 1 | none | 1,048K | 5 | 12 | Yes |
| 2 | cyclic=2 | 524K | 10 | 16 | Yes |
| 4 | cyclic=4 | 262K | 20 | 24 | Yes |
| 8 | cyclic=8 | 131K | 40 | 40 | Yes |
| 16 | cyclic=16 | 65K | 80 | 72 | Yes |
Recommended: unroll=8 (best latency within 200 DSP / 100 BRAM budget).
pipeline on the innermost loop, then work outward. Pipelining is the highest-impact single pragma in most designs.array_partition factors to the parallelism implied by unroll or pipeline. If you unroll by 4, you need at least 4 memory ports on each accessed array.dataflow and pipeline in the same scope conflict. Nested dataflow regions are not supported by most tools.unroll with no factor (defaults to complete unroll) on large loops — this explodes resource usage and synthesis time.#pragma HLS interface directives match the testbench's port protocol (ap_ctrl_hs vs. ap_ctrl_chain, axis vs. m_axi).#pragma HLS loop_tripcount min=N max=M to provide estimates, or restructure code to use compile-time constants.Zhang, N., Kim, S., Srinath, S., & Zhang, Z. (2026). From Pragmas to Partners: A Symbiotic Evolution of Agentic High-Level Synthesis. arXiv:2602.01401v3. https://arxiv.org/abs/2602.01401v3
Key insight: HLS code is highly permutable — pragma configurations can be swapped without changing functional semantics — making it an ideal substrate for agentic optimization through mixed-fidelity feedback loops and automated design space exploration.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".