skills/automated-customization-enterprise-code/SKILL.md
Customize LLMs for enterprise code repositories using semantic scopes -- automatically partition codebases into meaningful units (function bodies, conditionals, loops, logging patterns) and generate fine-tuning data pairs for repository-specific code completion. Use when: 'customize a model for our codebase', 'generate fine-tuning data from our repo', 'improve code completion for private code', 'extract semantic scopes from source files', 'build training pairs for code LLM', 'set up RAG for enterprise code completion'.
npx skillsauth add ndpvt-web/arxiv-claude-skills automated-customization-enterprise-codeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to apply the Semantic Scopes technique for automated LLM customization to enterprise code repositories. Rather than treating a codebase as flat text, the approach identifies semantically meaningful code units -- function bodies, loop bodies, conditional blocks, logging patterns, function calls -- and uses their boundaries to generate high-quality fine-tuning data pairs and RAG contexts. The result: moderately sized models fine-tuned with this method outperform much larger uncustomized models on repository-specific code completion.
Semantic scopes are recursive, logically cohesive units of code whose boundaries typically coincide with syntactic scopes in languages that use block delimiters. In Java and C/C++, scope candidates include: function/method definitions, for loop bodies, while loop bodies, if/else conditional bodies, and code between matching brackets or parentheses. The insight is that these scopes represent the natural units developers write and read -- completing an entire scope (not just the next token) is what developers actually need from code assistants.
The pipeline has three phases: Data Selection (collect files by language, store metadata), Scope Identification (parse each file to extract scope candidates with byte-offset boundaries, then filter by size: 50-1,000 bytes for the scope, minimum 200 bytes of preceding context), and Pair Generation (create query-label pairs where the query is the code prefix up to the scope start, and the label is the complete scope text terminated with an end-of-text token). To improve robustness, the system generates additional pairs starting at random byte offsets within the preceding context, not only at the exact scope boundary.
For RAG-based customization, code prefixes are embedded using a sentence transformer (e.g., all-MiniLM-L6-v2), stored in a vector index, and the top-k nearest neighbors (typically 3-5) are prepended as context before the completion query. Fine-tuning uses standard supervised training with the Adam optimizer. A practical finding: offloading the optimizer to CPU in 32-bit float arithmetic reduces gradient norm spikes during training.
Inventory the repository. Recursively scan the target codebase and classify files by programming language. Store file paths, sizes, and language tags in a JSON manifest. Exclude vendored code, generated files, and test fixtures unless explicitly requested.
Parse files and extract scope candidates. For each source file, use a parser (tree-sitter recommended) to identify all syntactic blocks: function/method bodies, for bodies, while bodies, if bodies, else bodies, and any nested scoped blocks. Record each scope's start byte offset, end byte offset, scope type, and parent scope.
Filter scopes by size constraints. Retain only scopes between 50 and 1,000 bytes in length. Require at least 200 bytes of preceding code in the same file before the scope start. Discard trivially small scopes (single-line returns, empty blocks) and overly large ones (entire class definitions spanning thousands of lines).
Generate primary training pairs. For each retained scope, create a pair: the query is the file content from the start (or a context window of ~2,048 tokens) up to the scope's start byte; the label is the scope content plus an end-of-text token. This teaches the model to complete the exact scope when positioned at its boundary.
Generate augmented training pairs. For each primary pair, create 1-3 additional pairs by selecting random byte offsets within the 200+ bytes preceding the scope start. The label remains the content from that offset through the end of the scope. This prevents the model from only learning to complete at exact scope boundaries.
Build the fine-tuning dataset. Combine all pairs, shuffle, and split into train/validation sets (e.g., 95/5). Target 200,000-300,000 pairs for repositories of 10,000+ files. Export in JSONL format with fields: prefix, completion, scope_type, file_path, language.
Configure and run fine-tuning. Use the Hugging Face Trainer with Adam optimizer (offload to CPU for 32-bit stability). Train for 3-10 epochs with a learning rate around 1e-5. Monitor gradient norms -- spikes indicate the need for CPU optimizer offloading or gradient clipping.
Build the RAG index (optional, complementary). Embed all training pair prefixes using all-MiniLM-L6-v2 or a code-specific embedding model. Store embeddings in a vector database (FAISS, Chroma, or similar). At inference time, embed the current code prefix, retrieve the top 3-5 nearest neighbors, and prepend them as context before the completion query.
Create evaluation test cases. Manually curate 15-40 test cases per scope category (func_body, if_body, for_body, else_body, logging, func_call). For each, record the prefix and the ground-truth scope completion.
Evaluate with Levenshtein distance. Compute two metrics for each prediction: Full (edit distance between the complete prediction and ground truth) and Opt (minimum edit distance across all prefixes of the prediction, which forgives over-generation). Compare base model vs. fine-tuned model vs. RAG-augmented model across all scope categories.
Example 1: Generating fine-tuning data from a Java repository
User: "I have a Java microservices repo at ./services/ with about 15K files. Help me generate fine-tuning data for code completion."
Approach:
./services/ recursively, collecting all .java files into a manifestOutput (sample pair from the JSONL):
{
"prefix": "package com.acme.orders;\n\nimport com.acme.logging.AcmeLogger;\nimport com.acme.models.Order;\n\npublic class OrderService {\n private final AcmeLogger logger;\n \n public OrderStatus processOrder(Order order) {\n logger.info(\"Processing order\", order.getId());\n ",
"completion": "if (order.getItems().isEmpty()) {\n logger.warn(\"Empty order received\", order.getId());\n return OrderStatus.REJECTED;\n }<|endoftext|>",
"scope_type": "if_body",
"file_path": "services/order-service/src/main/java/com/acme/orders/OrderService.java",
"language": "java"
}
Example 2: Setting up RAG for C++ code completion
User: "Our C++ codebase uses custom logging macros like PD_LOG_BEGIN/PD_LOG_END. Models always get these wrong. Can we fix this with RAG?"
Approach:
logging-category scopes containing the custom macrosall-MiniLM-L6-v2 and index in FAISSOutput (RAG-augmented prompt sent to the model):
// Retrieved context example 1:
PD_LOG_BEGIN(logger, INFO)
<< "Connection established: " << conn.getId()
<< " endpoint=" << conn.getEndpoint()
PD_LOG_END
// Retrieved context example 2:
PD_LOG_BEGIN(logger, WARN)
<< "Retry attempt " << retryCount
<< " for request=" << req.getId()
PD_LOG_END
// Current code to complete:
void handleTimeout(const Request& req, int elapsed) {
if (elapsed > threshold) {
PD_LOG_BEGIN(logger, ERROR)
Example 3: Evaluating fine-tuned model improvement
User: "I fine-tuned Granite-8B on our repo for 5 epochs. How do I measure if it actually improved?"
Approach:
func_body, if_body, for_body, else_body, logging, func_callOutput (evaluation summary table):
Category | N | Base (Opt) | FT (Opt) | Improvement
------------|-----|------------|----------|------------
func_body | 38 | 156 | 72 | 53.8%
if_body | 30 | 112 | 48 | 57.1%
for_body | 16 | 89 | 41 | 53.9%
else_body | 20 | 95 | 52 | 45.3%
logging | 36 | 347 | 129 | 62.8%
func_call | 80 | 84 | 42 | 50.0%
scope_type label in your training metadata so you can evaluate per-category performance and diagnose which patterns the model struggles with.codebert-base or unixcoder-base instead of general-purpose sentence transformers.Finkler et al., 2026. "Automated Customization of LLMs for Enterprise Code Repositories Using Semantic Scopes." arXiv:2602.05780 -- Focus on Section 3 (Semantic Scope definition and extraction), Section 4 (training pair generation pipeline), and Table 2 (per-category evaluation results showing fine-tuning improvements).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".