skills/evaluating-achieving-controllable-code/SKILL.md
Instruction-guided code completion that follows user constraints on algorithm choice, data structures, control flow, and code scope. Use when: 'complete this function using a deque-based BFS', 'finish this code with exactly 3 lines', 'implement the sort using quicksort not mergesort', 'complete using recursion instead of iteration', 'fill in this block with a single for loop', 'generate the rest using dynamic programming'.
npx skillsauth add ndpvt-web/arxiv-claude-skills evaluating-achieving-controllable-codeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to perform instruction-guided code completion — generating code that not only works correctly but strictly follows user-specified constraints on implementation approach, algorithm choice, data structures, control flow patterns, and code scope. Based on the C3-Bench framework (arXiv:2601.15879), this technique separates two orthogonal concerns: functional correctness (does the code pass tests?) and instruction adherence (does it follow the user's stated constraints?). Most code completion today optimizes only for correctness; this skill ensures Claude also respects how the user wants something implemented.
left/right pointers, not i/j")The C3-Bench paper identifies two independent dimensions of controllable code completion:
Implementation-Control Completion (ICC): The user constrains how code is implemented. This covers four sub-categories: (1) Structural Specifications — requiring specific data structures or class hierarchies, (2) Algorithmic Implementation — mandating a particular algorithm or approach, (3) Control Flow — specifying execution patterns like recursion vs. iteration or loop types, and (4) Critical Parameters — constraining variable names, default values, or configuration choices. The key insight is that for any given function signature and test suite, there are often multiple functionally equivalent implementations that differ in approach. The user's instruction selects which one to produce.
Scale-Control Completion (SCC): The user constrains how much code is generated. This covers three scopes: (1) Line Span — completing a partial line, (2) Multi-line — generating exactly N complete lines, and (3) Statement Block — producing exactly one control structure (a single for-loop, a single if-else, etc.). This prevents the model from over-generating or under-generating relative to the user's intent.
The actionable insight is: treat every code completion as a two-objective problem. First, decompose the user's request into functional requirements (what must the code do?) and control requirements (how must it do it, and how much should be generated?). Then generate code that satisfies both, verifying each independently.
Parse the completion context. Read the surrounding code — the prefix (code before the cursor), the suffix (code after the cursor if available), the function signature, imports, and any docstrings. Identify what functionality is expected from the structural context.
Extract the user's control instructions. Separate the request into two categories:
Identify the functionally equivalent alternatives. Before generating, mentally enumerate 2-3 valid approaches that would satisfy the functional requirement. This ensures you understand the choice space the user is constraining. For example, if the task is shortest path, alternatives include Dijkstra's, Bellman-Ford, SPFA, or Floyd-Warshall.
Select the approach matching the user's instruction. From the alternatives identified in step 3, choose the one that matches the user's constraint. If the instruction is ambiguous, ask for clarification rather than guessing.
Draft the completion within scope constraints. If the user specified a line count or block type, generate exactly that amount. Do not add extra helper lines, comments, or blank lines that violate the scope. If the user said "one for loop," produce exactly one for loop — not a for loop plus a conditional.
Verify functional correctness. Mentally trace or actually run the generated code against the expected behavior. Ensure it handles edge cases the same way the surrounding code implies.
Verify instruction adherence independently. After confirming correctness, re-read the user's control instructions and check each one:
If either check fails, regenerate. Do not ship code that is correct but violates the instruction, or code that follows the instruction but is buggy. Both objectives must be met.
Present the completion with a brief explanation of how it satisfies both the functional and control requirements, so the user can verify.
Example 1: Algorithm-Constrained Completion
User provides this partial code and asks: "Complete the shortest_path function using the SPFA algorithm with Small Label First (SLF) optimization using a deque."
# Prefix:
from collections import deque
def shortest_path(graph, source, n):
dist = [float('inf')] * n
dist[source] = 0
in_queue = [False] * n
# CURSOR — complete from here
# Suffix:
return dist
Approach:
Output:
q = deque([source])
in_queue[source] = True
while q:
u = q.popleft()
in_queue[u] = False
for v, w in graph[u]:
if dist[u] + w < dist[v]:
dist[v] = dist[u] + w
if not in_queue[v]:
in_queue[v] = True
# SLF: if new dist is small, push to front
if q and dist[v] < dist[q[0]]:
q.appendleft(v)
else:
q.append(v)
Verification: Uses deque (not heap or list). Implements SLF via the appendleft conditional. SPFA relaxation pattern, not Dijkstra's greedy extraction.
Example 2: Scope-Constrained Completion
User: "Complete the missing part with exactly one for-loop block. No additional statements outside the loop."
def flatten(nested_list):
result = []
# CURSOR — complete with exactly one for-loop
return result
Approach:
result.Output:
for item in nested_list:
if isinstance(item, list):
result.extend(flatten(item))
else:
result.append(item)
Verification: Exactly one for-loop block. The if/else is nested inside the for-loop, not an additional top-level statement. No extra lines outside the loop.
Example 3: Data Structure and Control Flow Constraints
User: "Complete this using a min-heap (heapq), not a sorted list. Use a while loop to process, not a for loop."
import heapq
def k_smallest_pairs(nums1, nums2, k):
"""Return k pairs with smallest sums from nums1 and nums2."""
if not nums1 or not nums2:
return []
# CURSOR
Approach:
Output:
heap = []
for i in range(min(k, len(nums1))):
heapq.heappush(heap, (nums1[i] + nums2[0], i, 0))
result = []
while heap and len(result) < k:
_, i, j = heapq.heappop(heap)
result.append([nums1[i], nums2[j]])
if j + 1 < len(nums2):
heapq.heappush(heap, (nums1[i] + nums2[j + 1], i, j + 1))
return result
Verification: Uses heapq (not a sorted list). Main processing uses while heap (not a for loop). The initial for-loop is for seeding the heap, not for processing — the user's constraint applies to the processing loop.
list used as a stack or collections.deque.collections.deque for performance.development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".