skills/batcoder-self-supervised-bidirectional-code-docume/SKILL.md
Apply BatCoder's back-translation technique to improve code and documentation quality bidirectionally. Generate documentation from code, then verify it by reconstructing code from that documentation -- using cycle consistency as a quality signal. Triggers: 'generate documentation from code', 'back-translate code and docs', 'verify documentation quality', 'improve code with back-translation', 'self-supervised code documentation', 'cycle-consistent code generation'
npx skillsauth add ndpvt-web/arxiv-claude-skills batcoder-self-supervised-bidirectional-code-documeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to apply the BatCoder back-translation technique for jointly improving code generation and documentation production. The core idea: generate documentation from code, then reconstruct the original code from that documentation alone. If the reconstructed code faithfully matches the original, the documentation is semantically complete. If it diverges, the documentation has gaps. This cycle-consistency signal drives iterative improvement of both code clarity and documentation accuracy without requiring curated training pairs.
BatCoder introduces a back-translation cycle borrowed from machine translation. In natural language translation, back-translation generates synthetic training data by translating target-language text back to the source language. BatCoder applies this to code: given a code snippet c, first generate documentation d = f(c), then reconstruct code c' = g(d). The semantic similarity between c and c' -- measured via structural and semantic comparison -- serves as an implicit quality signal for both the documentation and the code generation process.
The key insight is that documentation quality can be measured indirectly through code reconstruction fidelity. If generated documentation captures all the logic, control flow, edge cases, and type constraints of the original code, then an independent code generation pass from that documentation should produce functionally equivalent code. Gaps in the documentation surface as divergences in the reconstructed code. This eliminates the need for human-annotated code-documentation pairs.
The reward combines two components: (1) a structural validity check on the documentation (does it contain required elements like description, parameter types, examples?), scored as R_doc in {0, 0.5, 1}, and (2) a semantic similarity score S(c, c') between original and reconstructed code, computed using program dependence graph comparison (CSSG metric, range [0,1]). The combined reward R = R_sim * R_doc drives iterative refinement. This approach scales with corpus size since it only requires raw code, not paired data.
Accept the input code snippet. Read the function or module the user wants documented. Identify the programming language, function signature, parameter types, return types, and any imports or dependencies.
Generate candidate documentation (Code-to-Doc). Produce a detailed docstring or specification from the code. Include: a one-line summary, detailed description of behavior, parameter descriptions with types, return value description, edge cases, and at least one input/output example using the language's doctest convention (e.g., >>> for Python, // examples for Go).
Validate documentation structure. Check that the generated documentation contains all required structural components: description, typed parameters, return specification, and usage examples. Assign a structural validity score: 1.0 if all present, 0.5 if partially present, 0.0 if missing critical elements. Reject and regenerate if score is 0.
Reconstruct code from documentation alone (Doc-to-Code). Using ONLY the generated documentation as input (not the original code), write a fresh implementation. Do not reference the original code during this step -- treat the documentation as the sole specification.
Compare original and reconstructed code for semantic equivalence. Analyze both implementations across these dimensions:
Identify semantic gaps. Where the reconstructed code diverges from the original, trace the divergence back to missing or ambiguous information in the documentation. List each gap explicitly.
Refine the documentation. Update the documentation to close each identified gap. Add missing edge case descriptions, clarify ambiguous parameter constraints, and add examples that cover the divergent behavior.
Optionally refine the code. If the back-translation reveals that the original code has unclear logic (e.g., magic numbers, implicit assumptions, convoluted control flow), suggest code improvements that make the intent more explicit without changing behavior.
Re-run the cycle if needed. For complex functions, repeat steps 4-7 once more to verify the refined documentation now produces a faithful reconstruction.
Deliver the final documentation and any code suggestions. Present the validated documentation alongside a summary of what the back-translation cycle revealed and fixed.
Example 1: Python function with subtle edge cases
User: "Generate documentation for this function."
def merge_intervals(intervals):
if not intervals:
return []
intervals.sort(key=lambda x: x[0])
merged = [intervals[0]]
for current in intervals[1:]:
if current[0] <= merged[-1][1]:
merged[-1] = (merged[-1][0], max(merged[-1][1], current[1]))
else:
merged.append(current)
return merged
Approach:
<= not <), tuple reconstruction, and empty-input handling.(1,2) and (2,3) -- the <= matters), overlapping intervals, already-sorted input, tuples vs lists.end == start) also merge. Reconstruction used < instead of <=.Output:
def merge_intervals(intervals):
"""Merge overlapping or touching intervals into non-overlapping intervals.
Takes a list of (start, end) tuples and returns a new list where all
overlapping or adjacent intervals are combined. Two intervals are merged
if one starts at or before the other ends (i.e., touching intervals like
(1, 2) and (2, 3) become (1, 3)).
The input list is sorted in-place by start value before merging.
Args:
intervals: List of (start, end) tuples. Can be empty.
Returns:
List of merged (start, end) tuples in sorted order.
Returns [] for empty input.
Examples:
>>> merge_intervals([(1, 3), (2, 6), (8, 10)])
[(1, 6), (8, 10)]
>>> merge_intervals([(1, 2), (2, 3)])
[(1, 3)]
>>> merge_intervals([])
[]
"""
Example 2: Go function -- back-translation reveals ambiguity
User: "Document this Go function and verify the docs are accurate."
func Deduplicate(items []string) []string {
seen := make(map[string]bool)
result := []string{}
for _, item := range items {
if !seen[strings.ToLower(item)] {
seen[strings.ToLower(item)] = true
result = append(result, item)
}
}
return result
}
Approach:
Output:
// Deduplicate removes case-insensitive duplicate strings from the input slice,
// preserving the original casing of the first occurrence of each unique string.
// Order of first appearances is maintained.
//
// Comparison is performed by lowercasing each string via strings.ToLower.
// For input ["Hello", "HELLO", "world"], returns ["Hello", "world"].
//
// Returns an empty slice (not nil) for empty input.
func Deduplicate(items []string) []string {
Example 3: Validating existing documentation
User: "Check if this docstring actually matches what the code does."
def retry(func, max_attempts=3, delay=1.0):
"""Retry a function up to max_attempts times with a fixed delay between attempts.
Args:
func: Callable to retry.
max_attempts: Maximum number of retries (default 3).
delay: Seconds to wait between retries (default 1.0).
Returns:
The return value of func on success.
Raises:
Exception: The last exception if all attempts fail.
"""
for attempt in range(max_attempts):
try:
return func()
except Exception:
if attempt == max_attempts - 1:
raise
time.sleep(delay * (2 ** attempt))
Approach:
time.sleep(delay) each time.delay * (2 ** attempt) -- exponential backoff, not fixed delay.delay seconds, doubling each attempt") or fix the code (change to actual fixed delay).< vs <=, range(n) vs range(n+1)) -- these are the most common gaps that back-translation reveals.>>> examples, JSDoc for JavaScript, GoDoc comments for Go) to maximize structural validity.Paper: BatCoder: Self-Supervised Bidirectional Code-Documentation Learning via Back-Translation (Xu et al., 2026). Key sections: Section 3 for the back-translation framework and reward formulation (R = R_sim * R_doc), Section 4 for ablation results showing Stage 1 (code-to-doc) is critical, and Table 1 for benchmark comparisons demonstrating 83.5% pass@1 on HumanEval.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".