skills/skillxiv-v0.0.2-claude-opus-4.6/enterprise-tool-calling-finetuning/SKILL.md
Train LLMs to disambiguate tool calls in enterprise settings where multiple similar APIs exist and parameters are incomplete. Generates synthetic multi-turn dialogues with realistic ambiguity to improve tool selection accuracy by 27+ percentage points.
npx skillsauth add ADu2021/skillXiv enterprise-tool-calling-finetuningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
In enterprise environments, a single business query often maps to multiple near-duplicate APIs, and most real calls arrive missing required parameters. Current LLMs trained on clean, fully-specified tool-calling benchmarks fail dramatically when deployed against production APIs. DiaFORGE solves this mismatch by generating realistic, ambiguous multi-turn dialogues where the agent must disambiguate between competing tools and ask clarifying questions—turning a brittle single-turn problem into a robust interactive one.
The gap between benchmark performance and production reality is massive: static benchmarks use fully-specified queries with one obvious tool, while real APIs have overlapping functionality, required parameters, and user confusion. DiaFORGE bridges this gap through synthetic dialogue generation that mimics real deployment dynamics.
The framework operates on the insight that realistic tool-calling is fundamentally interactive. Rather than expecting agents to infer all details from a single query, train them to:
This interactive approach reduces both false positives (wrong tool calls) and false negatives (refusing to act).
Generate synthetic dialogues using the UTC-Gen multi-agent system. The engine seeds conversations with a ground-truth tool and includes semantic "distractor" tools:
from diaforge.generator import UTCGen
gen = UTCGen(api_catalog="enterprise_apis.json")
# Seed dialogue with ground-truth tool and distractors
dialogue = gen.generate_dialogue(
ground_truth_tool="CRM.CreateLead",
# Similar tools to create disambiguation need
distractors=[
"CRM.UpdateLead",
"CRM.CreateContact",
"Sales.CreateOpportunity"
],
# Define what information is revealed per turn
information_stages=[
{"turn": 1, "revealed": ["customer_name"]},
{"turn": 2, "revealed": ["email", "phone"]},
{"turn": 3, "revealed": ["industry", "company_size"]}
]
)
# Output: multi-turn conversation where assistant must ask
# clarifying questions to determine correct tool
print(dialogue)
# User: "Add John to our system"
# Assistant: "Is John a prospect or an existing customer?"
# User: "He's a new prospect we want to track"
# Assistant: "I'll create him as a new lead. What's his email?"
Fine-tune an open-source model (like Llama) on the generated dialogues. Use supervised learning with loss masking to focus on assistant responses:
from diaforge.training import ToolCallingTrainer
trainer = ToolCallingTrainer(model="meta-llama/Llama-2-7b-chat")
# Load generated dialogues
dialogues = load_synthetic_data("generated_dialogues.jsonl")
# Fine-tune focusing only on assistant turns
trainer.train(
dialogues=dialogues,
loss_mask="assistant_only", # Only optimize assistant responses
epochs=3,
batch_size=32,
learning_rate=2e-5
)
Evaluate models using dynamic evaluation where they interact with a user simulator:
from diaforge.evaluation import DynamicEvaluator
evaluator = DynamicEvaluator(
user_simulator="interactive", # Simulated user responds to agent questions
api_catalog="enterprise_apis.json"
)
# Full interactive rollout: agent picks tool, gets feedback
results = evaluator.evaluate(
model=trained_model,
test_scenarios=test_dialogues,
metrics={
"tool_accuracy": "Did agent pick the right tool?",
"false_positives": "Any incorrect tool calls?",
"abstention_rate": "When does agent refuse to act?",
"success_rate": "Did tool call complete the user's goal?"
}
)
print(f"Tool accuracy: {results['tool_accuracy']:.1%}")
print(f"Success rate: {results['success_rate']:.1%}")
Use this approach when:
Avoid this for:
| Scenario | Distractors | Difficulty | |----------|------------|-----------| | Simple disambiguation | 2-3 similar tools | Baseline | | Semantic overlap | 4-5 tools with overlapping capabilities | Medium | | Parameter interdependence | Tools where required params depend on previous choices | Hard | | Multi-step resolution | Requires 3+ clarifying questions to fully specify | Expert |
Generate ~5,000 dialogues mixing all difficulty levels for robust coverage.
The paper reports DiaFORGE-trained models achieve 27 percentage points improvement over GPT-4o and 49 points over Claude-3.5-Sonnet on dynamic evaluation.
"Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky" - arXiv:2507.03336
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.