skills/benchmarking-abap-code-generation/SKILL.md
Generate syntactically correct and functional ABAP code using iterative compiler feedback loops. Applies the empirical methodology from Wallraven et al. (2026) to produce SAP ABAP classes that pass syntax checks and unit tests through up to 5 rounds of error-driven refinement. Trigger phrases: 'generate ABAP code', 'write ABAP class', 'fix ABAP syntax error', 'ABAP compiler feedback', 'SAP ABAP development', 'iterative ABAP correction'.
npx skillsauth add ndpvt-web/arxiv-claude-skills benchmarking-abap-code-generationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to generate high-quality SAP ABAP code by applying an iterative compiler feedback loop derived from the empirical benchmark study by Wallraven et al. The core insight: initial ABAP generation succeeds only ~24% of the time, but feeding compiler error messages back into subsequent generation attempts raises success rates to ~75% within 5 rounds. This skill encodes the prompt structure, error classification taxonomy, and iterative correction strategy that proved most effective across 180 benchmark tasks spanning string handling, list operations, mathematical calculations, logical conditions, and SAP database operations.
ABAP is a low-resource, proprietary language deeply embedded in the SAP ecosystem. LLMs have significantly less ABAP training data than languages like Python or JavaScript, which causes high initial error rates. The paper demonstrates that a structured feedback loop---where compiler diagnostics are fed back to the LLM for correction---dramatically improves output quality.
The workflow operates in up to 5 refinement rounds. In Round 0, the LLM generates ABAP code from the task description alone. If compilation or unit tests fail, the specific error message from the SAP server is appended to the prompt, and the LLM generates a corrected version. The empirical data shows the largest gains come in the first two rounds (initial ~24% jumps to ~42% after Round 1, ~53% after Round 2), with diminishing but still valuable returns through Round 5. The improvement curve has not fully flattened at 5 rounds, meaning additional iterations could yield further gains.
Error types fall into distinct categories that require different correction strategies: class creation errors (structural problems preventing the class from being created), syntax errors (type/conversion errors, declaration errors, lexical errors, structural issues), and unit test failures (code compiles but produces wrong results). The correction approach must differ: syntax errors benefit from targeted fixes to the specific line, while unit test failures often require rethinking the algorithm logic.
Constrain the generation environment. Before writing any ABAP, establish the target system version (e.g., NetWeaver 7.57 / S/4HANA 2022), naming conventions (classes starting with Z), and parameter style (use RETURNING parameters, not EXPORTING). These constraints prevent an entire class of structural errors.
Structure the output as a global ABAP class. Generate a complete class with both DEFINITION and IMPLEMENTATION sections. Use a single public static method per class. This mirrors the format that SAP systems expect and avoids the most common class creation failures.
Generate the initial ABAP code (Round 0). Produce the code based solely on the task description. Use a low temperature (0.2) mentally---favor the most likely correct ABAP constructs over creative alternatives. Output only code, no explanations.
Classify any compiler errors by phase and type. When the user reports errors, determine: (a) Did the class fail to create? (b) Did it fail syntax checking? (c) Did it fail at unit test execution? Then sub-classify syntax errors as declaration errors, lexical errors, type/conversion errors, or structural errors. This classification drives the correction strategy.
Apply targeted corrections based on error category.
'...'), check for invalid characters, ensure correct use of periods as statement terminators.Preserve working portions of the code. When correcting, change only what the error message indicates is broken. Do not rewrite the entire class---this risks introducing new errors in previously correct sections.
For unit test failures, re-examine the algorithm. If the code compiles but tests fail, the logic is wrong, not the syntax. Re-read the task requirements, check edge cases (empty inputs, boundary values, type overflows), and verify that ABAP-specific behavior (0-based vs 1-based indexing, string operations) matches expectations.
Iterate up to 5 rounds. Each round should incorporate the latest error message. If the same error persists after 2 rounds, try a fundamentally different approach rather than making incremental fixes to the same structure.
Validate ABAP-specific constructs. Before finalizing, check: internal table operations use correct syntax (APPEND, READ TABLE, LOOP AT), SELECT statements use proper Open SQL syntax, numeric literals use periods as decimal separators, and method signatures match the expected interface.
Output the final class as a single self-contained code block. Include the complete CLASS ... DEFINITION and CLASS ... IMPLEMENTATION so the user can directly paste it into their SAP system.
Example 1: String Reversal Task
User: Write an ABAP class that reverses a string. The class should be named ZCL_STRING_REVERSE with a static method REVERSE that takes a string and returns the reversed version.
Approach:
Output (Round 0):
CLASS zcl_string_reverse DEFINITION PUBLIC FINAL CREATE PUBLIC.
PUBLIC SECTION.
CLASS-METHODS reverse
IMPORTING iv_input TYPE string
RETURNING VALUE(rv_result) TYPE string.
ENDCLASS.
CLASS zcl_string_reverse IMPLEMENTATION.
METHOD reverse.
DATA(lv_len) = strlen( iv_input ).
DATA(lv_index) = lv_len - 1.
WHILE lv_index >= 0.
rv_result = rv_result && iv_input+lv_index(1).
lv_index = lv_index - 1.
ENDWHILE.
ENDMETHOD.
ENDCLASS.
Example 2: Iterative Correction from Compiler Error
User: I tried your ABAP code and got this error: "LV_RESULT" is not declared. Use a DATA statement.
Approach:
DATA(...) require certain system versions; explicit DATA statement is saferCorrected output (Round 1):
METHOD calculate_sum.
DATA lv_result TYPE i.
DATA lv_index TYPE i.
lv_index = 0.
WHILE lv_index < lines( it_numbers ).
lv_index = lv_index + 1.
READ TABLE it_numbers INDEX lv_index INTO DATA(lv_num).
lv_result = lv_result + lv_num.
ENDWHILE.
rv_sum = lv_result.
ENDMETHOD.
Example 3: SAP Database Operation Task
User: Write an ABAP class that reads all entries from table SFLIGHT where the carrier ID is 'LH' and returns the total price as a decimal.
Approach:
Output:
CLASS zcl_flight_price DEFINITION PUBLIC FINAL CREATE PUBLIC.
PUBLIC SECTION.
CLASS-METHODS get_total_price
IMPORTING iv_carrid TYPE s_carr_id
RETURNING VALUE(rv_total) TYPE s_price.
ENDCLASS.
CLASS zcl_flight_price IMPLEMENTATION.
METHOD get_total_price.
SELECT SUM( price ) FROM sflight
INTO rv_total
WHERE carrid = iv_carrid.
ENDMETHOD.
ENDCLASS.
Example 4: Unit Test Failure Correction
User: The code compiles fine but the unit test fails. Expected output for input [3, 1, 2] is [1, 2, 3] but I'm getting [3, 2, 1].
Approach:
Correction: Reverse the comparison operator in the sorting condition from > to < (or swap the ascending/descending flag), keeping all declarations and class structure intact.
DEFINITION and IMPLEMENTATION sections---partial code causes class creation failures, the most catastrophic error type.RETURNING VALUE(...) for method output parameters unless the user explicitly requires EXPORTING. This is the modern ABAP convention and avoids a common structural error.DATA declarations over inline DATA(...) when targeting older NetWeaver systems. Ask the user about their system version if uncertain.Class creation failures (most severe): The entire class structure is malformed. Common causes: missing ENDCLASS, wrong section ordering, invalid class name. Response: regenerate the full class skeleton from scratch rather than patching.
Syntax errors after successful class creation: The class structure is valid but individual statements are wrong. Use the compiler's line number and error code to make surgical fixes. The most frequent sub-types:
Unit test failures with clean compilation: The hardest category to fix because the error is logical, not syntactic. Re-read the task specification carefully, check boundary conditions, and verify that ABAP's specific behavior (1-based table indexing, string offset handling) matches expectations.
Persistent errors after 2+ rounds: If the same error recurs, the approach is fundamentally flawed. Rewrite the method body using a different algorithm rather than continuing to patch the same code.
Wallraven, S., Köhne, T., Westenberger, H., & Moser, A. (2026). Benchmarking Large Language Models for ABAP Code Generation: An Empirical Study on Iterative Improvement by Compiler Feedback. arXiv:2601.15188v1. Key takeaway: structured compiler feedback loops with up to 5 iterations raise ABAP code generation success from ~24% to ~75%, with the largest gains in the first two rounds.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".