skills/funprm-function-as-step-process-reward/SKILL.md
Generate high-quality code by decomposing solutions into modular functions (Chain-of-Function style), then self-evaluating each function as a discrete reasoning step to select the best candidate. Triggers: 'solve this coding problem', 'generate modular code', 'write clean functions for this algorithm', 'best-of-N code generation', 'help me with this competitive programming problem', 'generate multiple solution candidates'
npx skillsauth add ndpvt-web/arxiv-claude-skills funprm-function-as-step-process-rewardInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill teaches Claude to apply the FunPRM (Function-as-Step Process Reward Model) technique from research by Zhang et al. (2026). Instead of writing monolithic code blocks, Claude decomposes every solution into well-named helper functions, treats each function as a scoreable reasoning step, generates multiple candidate solutions, and selects the best one by evaluating per-function correctness. This produces code that is more likely correct, more readable, and more reusable -- particularly on hard algorithmic and competitive programming problems.
Why functions as steps? Traditional process reward models for math treat each line of reasoning as a step. This fails for code because individual lines are not semantically meaningful units. FunPRM's insight is that functions are the natural reasoning steps in code: each one encapsulates a distinct sub-problem, has a clear interface (inputs/outputs), and can be evaluated for correctness independently. A solution decomposed into 3-6 well-named functions is far easier to score than a 50-line monolithic block.
The Chain-of-Function prompting pattern instructs the LLM to: (1) place the high-level solving logic in a main entry-point function, (2) extract repeated or complex logic into dedicated helper functions at the top level (no nesting), and (3) add docstrings to every function explaining its purpose and algorithmic approach. This mirrors how experienced developers naturally structure code -- and it gives the self-evaluation pass clear units to reason about.
Self-evaluation via per-function scoring: After generating N candidate solutions (typically 3-8), evaluate each by mentally tracing through every function: Does it handle edge cases? Is its logic consistent with its docstring? Does the main function compose helpers correctly? Aggregate per-function confidence into a solution-level score and select the highest-scoring candidate. This is the core Best-of-N selection mechanism, adapted for a single-model workflow without a separate trained reward model.
Analyze the problem requirements. Read the problem statement carefully. Identify inputs, outputs, constraints, and edge cases. Note the expected time/space complexity if specified.
Decompose into sub-problems. Before writing any code, list 2-5 distinct sub-tasks the solution requires. For example, a graph problem might decompose into: build adjacency list, run BFS/DFS, extract result from traversal state. Each sub-task becomes a function.
Design the function hierarchy. Define the main entry-point function signature first. Then define helper function signatures with clear parameter names and return types. Order them: main function first, helpers below. No nested function definitions -- all functions at the top level.
Write each function with a docstring. Every function gets a 1-2 line docstring explaining: (a) what it computes, (b) the algorithmic approach it uses. The main function's docstring describes the overall strategy. Helper docstrings describe the specific sub-logic.
Generate N candidate solutions (N=3-5). Produce multiple distinct implementations. Vary the algorithmic approach where possible (e.g., iterative vs recursive, different data structures). If variation is limited, vary the decomposition boundaries or edge-case handling. Each candidate must follow the Chain-of-Function structure.
Score each candidate per-function. For each candidate, evaluate every function independently:
Aggregate and select. The solution-level score is the average of per-function scores. Select the candidate with the highest aggregate score. If two candidates tie, prefer the one with fewer functions (simpler decomposition) or better edge-case coverage.
Verify with mental dry-run. Trace the selected solution through 2-3 test cases including at least one edge case. If it fails, either fix it or fall back to the next-best candidate.
Present the solution. Output the selected code with its function structure intact. Include the docstrings. If the user asked for a single solution, show only the winner. If they asked for candidates, show all with your scoring rationale.
Example 1: Binary search on answer
User: "Find the minimum capacity needed to ship packages within D days, given a list of package weights."
Approach:
def ship_within_days(weights: list[int], days: int) -> int:
"""Find minimum ship capacity to deliver all packages within given days.
Approach: Binary search on capacity between max(weights) and sum(weights)."""
lo, hi = max(weights), sum(weights)
while lo < hi:
mid = (lo + hi) // 2
if can_ship_in_time(weights, mid, days):
hi = mid
else:
lo = mid + 1
return lo
def can_ship_in_time(weights: list[int], capacity: int, days: int) -> bool:
"""Check whether all packages can be shipped within the day limit at given capacity.
Approach: Greedily fill each day up to capacity, count days used."""
days_needed = 1
current_load = 0
for w in weights:
if current_load + w > capacity:
days_needed += 1
current_load = 0
current_load += w
return days_needed <= days
Per-function evaluation:
ship_within_days: Binary search bounds correct (lo=max single package, hi=all in one day). Loop termination correct (lo < hi with hi = mid). High confidence.can_ship_in_time: Greedy simulation correctly resets load on overflow, starts at days_needed=1. Handles single-element case. High confidence.Example 2: Complex string parsing with multiple candidates
User: "Parse a nested mathematical expression like 2*(3+(4*5)) and evaluate it."
Approach:
Candidate A -- Recursive descent with explicit tokenizer:
def evaluate_expression(expr: str) -> int:
"""Evaluate a nested math expression string.
Approach: Tokenize, then recursive descent parse and evaluate."""
tokens = tokenize(expr)
result, _ = parse_expr(tokens, 0)
return result
def tokenize(expr: str) -> list[str]:
"""Split expression into number and operator tokens.
Approach: Scan characters, accumulate digits into number tokens."""
tokens = []
i = 0
while i < len(expr):
if expr[i].isdigit():
j = i
while j < len(expr) and expr[j].isdigit():
j += 1
tokens.append(expr[i:j])
i = j
elif expr[i] in '+-*()':
tokens.append(expr[i])
i += 1
else:
i += 1 # skip whitespace
return tokens
def parse_expr(tokens: list[str], pos: int) -> tuple[int, int]:
"""Parse additive expression (handles + and -).
Approach: Parse left term, then consume +/- operators greedily."""
left, pos = parse_term(tokens, pos)
while pos < len(tokens) and tokens[pos] in '+-':
op = tokens[pos]
right, pos = parse_term(tokens, pos + 1)
left = left + right if op == '+' else left - right
return left, pos
def parse_term(tokens: list[str], pos: int) -> tuple[int, int]:
"""Parse multiplicative expression (handles *).
Approach: Parse left factor, then consume * operators greedily."""
left, pos = parse_factor(tokens, pos)
while pos < len(tokens) and tokens[pos] == '*':
right, pos = parse_factor(tokens, pos + 1)
left *= right
return left, pos
def parse_factor(tokens: list[str], pos: int) -> tuple[int, int]:
"""Parse a number or parenthesized sub-expression.
Approach: If '(' consume until matching ')', else parse number."""
if tokens[pos] == '(':
val, pos = parse_expr(tokens, pos + 1)
pos += 1 # skip ')'
return val, pos
return int(tokens[pos]), pos + 1
Candidate B -- Stack-based evaluation (single pass, no AST):
def evaluate_expression(expr: str) -> int:
"""Evaluate nested math expression using operator-precedence stack.
Approach: Two stacks (values, operators), process by precedence."""
# ... (implementation with precedence handling)
Per-function evaluation of Candidate A:
tokenize: Handles multi-digit numbers, skips whitespace, captures all operators. High.parse_expr: Correctly delegates to parse_term for precedence. Left-associative. High.parse_term: Mirrors parse_expr for multiplication. High.parse_factor: Handles parens and numbers. Correctly advances past closing paren. High.evaluate_expression: Thin wrapper, correct. High.Candidate B stack-based: More error-prone with precedence edge cases. Medium.
Selected: Candidate A (higher aggregate, clearer structure).
Example 3: Graph problem with edge-case focus
User: "Find if a path exists between two nodes in an undirected graph given as an edge list."
def valid_path(n: int, edges: list[list[int]], source: int, destination: int) -> bool:
"""Determine if a path exists between source and destination in an undirected graph.
Approach: Build adjacency list, then BFS from source."""
if source == destination:
return True
graph = build_adjacency_list(n, edges)
return bfs_path_exists(graph, source, destination)
def build_adjacency_list(n: int, edges: list[list[int]]) -> dict[int, list[int]]:
"""Convert edge list to adjacency list representation.
Approach: Iterate edges, add both directions for undirected graph."""
adj = {i: [] for i in range(n)}
for u, v in edges:
adj[u].append(v)
adj[v].append(u)
return adj
def bfs_path_exists(graph: dict[int, list[int]], source: int, target: int) -> bool:
"""Check reachability from source to target using BFS.
Approach: Standard BFS with visited set, return True if target found."""
from collections import deque
visited = {source}
queue = deque([source])
while queue:
node = queue.popleft()
for neighbor in graph[node]:
if neighbor == target:
return True
if neighbor not in visited:
visited.add(neighbor)
queue.append(neighbor)
return False
Per-function evaluation:
valid_path: Early return for source==destination (edge case). High.build_adjacency_list: Initializes all n nodes, adds both directions. High.bfs_path_exists: Uses visited set (no infinite loops), deque for O(1) pops. High.Do:
Avoid:
Zhang, R., Qin, P., Cao, Q., Xue, E., & Xie, P. (2026). FunPRM: Function-as-Step Process Reward Model with Meta Reward Correction for Code Generation. arXiv:2601.22249v1. https://arxiv.org/abs/2601.22249v1
Key insight: Functions are the natural unit of reasoning in code. Decomposing solutions into modular functions and evaluating each function independently produces better selection among candidates than evaluating whole solutions at once. The meta-learning reward correction mechanism further improves accuracy by using clean final-solution signals to denoise per-step estimates.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".