skills/skillxiv-v0.0.2-claude-opus-4.6/config-agents/SKILL.md
Learn optimal configurations for agentic AI systems through hierarchical RL that treats configuration as a query-wise decision problem. Structure policy selects workflows/tools/budgets while prompt policy composes specific instructions, achieving 25% accuracy improvement with 35% cost reduction.
npx skillsauth add ADu2021/skillXiv config-agentsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Agentic AI systems have enormous configuration spaces: tool selection, workflow choice, computational budget, prompt templates, reasoning strategies, etc. Over 8,600 possible structural configurations exist in simple systems. Manual tuning is infeasible. Static heuristics fail across different query types. Different queries benefit from different configurations (some need web search, others need local reasoning; some need 1-step, others need 5-step reasoning). The challenge: learn to configure agents dynamically per query while balancing accuracy and cost.
ARC (Agentic Resource & Configuration learner) treats configuration as a query-wise decision problem solved through hierarchical RL. A two-level policy system:
Structure Policy: High-level decisions
Prompt Policy: Low-level decisions
Rather than exhaustive grid search, masked RL with SFT on successful trajectories enables efficient learning under sparse rewards.
Hierarchical policy architecture:
class HierarchicalConfigurationPolicy(nn.Module):
"""
Two-level policy: structure policy + prompt policy.
Structure decides what capabilities to use.
Prompt decides how to ask for them.
"""
def __init__(self, model, tool_library, workflow_library):
super().__init__()
self.model = model
self.tools = tool_library
self.workflows = workflow_library
# Structure policy: high-level configuration
self.structure_policy = nn.Sequential(
nn.Linear(768, 512),
nn.ReLU(),
nn.Linear(512, 256)
)
# Workflow head: which workflow type
self.workflow_head = nn.Linear(256, len(workflow_library))
# Tool selection head: which tools to enable
self.tool_head = nn.Linear(256, len(tool_library))
# Budget head: token/step budget
self.budget_head = nn.Linear(256, 10) # 10 budget levels
# Prompt policy: low-level instruction generation
self.prompt_policy = nn.TransformerDecoder(
nn.TransformerDecoderLayer(768, 8),
num_layers=2
)
def structure_decision(self, query):
"""
High-level structure decisions given query.
Returns: workflow, tools, budget.
"""
# Encode query
query_features = self.model.encode(query)
# Structure policy forward pass
structure_hidden = self.structure_policy(query_features)
# Decode structure choices
workflow_logits = self.workflow_head(structure_hidden)
tool_logits = self.tool_head(structure_hidden)
budget_logits = self.budget_head(structure_hidden)
# Sample/select decisions
workflow = torch.softmax(workflow_logits, dim=-1)
tools = torch.sigmoid(tool_logits) # Multi-binary selection
budget = torch.softmax(budget_logits, dim=-1)
return {
'workflow': workflow,
'tools': tools,
'budget': budget,
'logits': {
'workflow': workflow_logits,
'tools': tool_logits,
'budget': budget_logits
}
}
def action_masking(self, structure_decision):
"""
Mask invalid action combinations.
E.g., can't use web_search without internet access budget.
"""
workflow = structure_decision['workflow'].argmax()
tools = structure_decision['tools']
budget = structure_decision['budget'].argmax()
# Define constraints
constraints = {
'web_search': ['sequential', 'hierarchical'], # Need internet
'code_execution': ['dedicated_budget'], # Needs computation budget
'long_reasoning': ['hierarchical'] # Better for CoT
}
# Create mask
mask = torch.ones_like(tools)
for tool_idx, tool_name in enumerate(self.tools.keys()):
compatible_workflows = constraints.get(tool_name, [])
# Check compatibility
if compatible_workflows:
workflow_name = self.workflows[workflow]
if workflow_name not in compatible_workflows:
mask[tool_idx] = 0
return mask
def prompt_decision(self, query, structure_decision):
"""
Low-level prompt composition given structure.
Generates specific instructions for the agent.
"""
# Get structure information
workflow = structure_decision['workflow'].argmax()
selected_tools = structure_decision['tools'].nonzero(as_tuple=True)[0]
budget_level = structure_decision['budget'].argmax()
# Build prompt based on decisions
prompt_parts = []
# Add workflow instruction
workflow_name = list(self.workflows.keys())[workflow]
if workflow_name == 'sequential':
prompt_parts.append(
"Solve step-by-step using one tool at a time.")
elif workflow_name == 'hierarchical':
prompt_parts.append(
"Break into subproblems, solve each independently.")
elif workflow_name == 'parallel':
prompt_parts.append(
"Work on multiple approaches simultaneously.")
# Add tool instructions
for tool_idx in selected_tools:
tool_name = list(self.tools.keys())[tool_idx]
tool_instruction = self.tools[tool_name]['instruction']
prompt_parts.append(f"You can use: {tool_instruction}")
# Add budget constraints
budget_values = [100, 250, 500, 1000, 2000, 5000, 10000,
20000, 50000, 100000]
max_tokens = budget_values[budget_level]
prompt_parts.append(
f"Keep total reasoning under {max_tokens} tokens.")
# Add query
prompt_parts.append(f"Query: {query}")
composed_prompt = "\n".join(prompt_parts)
return {
'composed_prompt': composed_prompt,
'workflow': workflow_name,
'tools': [list(self.tools.keys())[i] for i in selected_tools],
'budget': budget_values[budget_level]
}
def forward(self, query):
"""
Full configuration pipeline.
"""
structure = self.structure_decision(query)
mask = self.action_masking(structure)
prompt = self.prompt_decision(query, structure)
return {
'structure': structure,
'mask': mask,
'prompt': prompt
}
Masked RL training:
class MaskedRLTrainer:
"""
Train configuration policy with action masking.
Prevents invalid configurations during exploration.
"""
def __init__(self, policy, environment):
self.policy = policy
self.env = environment
self.optimizer = torch.optim.Adam(
policy.parameters(), lr=1e-4)
def compute_masked_logits(self, logits, mask):
"""
Mask invalid actions before softmax.
"""
# Set logits of invalid actions to very negative
masked_logits = logits.clone()
masked_logits[mask == 0] = -1e10
return masked_logits
def sample_masked_action(self, logits, mask):
"""
Sample action respecting mask.
"""
masked_logits = self.compute_masked_logits(logits, mask)
probs = torch.softmax(masked_logits, dim=-1)
# Sample valid action only
valid_actions = torch.where(mask == 1)[0]
sampled_idx = torch.multinomial(
probs[valid_actions], num_samples=1)[0]
return valid_actions[sampled_idx]
def train_step(self, query_batch, num_epochs=1):
"""
PPO training step with masked actions.
"""
for epoch in range(num_epochs):
for query in query_batch:
# Get current policy output
config = self.policy(query)
structure = config['structure']
mask = config['mask']
# Sample configuration
sampled_workflow = self.sample_masked_action(
structure['logits']['workflow'],
torch.ones(len(self.policy.workflows)))
sampled_tools_logits = structure['logits']['tools']
sampled_tools = self.sample_masked_action(
sampled_tools_logits, mask)
# Execute configuration
trajectory = self.env.execute_agent_with_config(
query, config)
# Compute reward
reward = self.env.compute_reward(trajectory)
# Compute log probabilities
log_prob = self.compute_log_probability(
structure, sampled_workflow, sampled_tools)
# Policy gradient update
loss = -reward * log_prob
# Backward pass
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
yield {
'query': query,
'reward': reward.item(),
'loss': loss.item()
}
def compute_log_probability(self, structure_output,
sampled_workflow, sampled_tools):
"""Compute log probability of sampled action."""
workflow_logits = structure_output['logits']['workflow']
tools_logits = structure_output['logits']['tools']
# Log probability of workflow choice
workflow_log_prob = torch.nn.functional.log_softmax(
workflow_logits, dim=-1)[sampled_workflow]
# Log probability of tool selection
tools_log_prob = torch.nn.functional.log_softmax(
tools_logits, dim=-1)[sampled_tools]
return workflow_log_prob + tools_log_prob
Supervised fine-tuning on successful trajectories:
class ConfigurationSupervisedFinetuning:
"""
Fine-tune policy on successful configuration examples.
Bootstraps RL with high-quality trajectories.
"""
def __init__(self, policy, successful_trajectories):
self.policy = policy
self.successful_trajs = successful_trajectories
self.optimizer = torch.optim.Adam(
policy.parameters(), lr=5e-5)
def collect_successful_examples(self):
"""
Gather examples where configuration led to success.
"""
examples = []
for traj in self.successful_trajs:
query = traj['query']
config = traj['configuration']
reward = traj['reward']
if reward > success_threshold := 0.8:
examples.append((query, config))
return examples
def supervised_update(self, batch_size=32, num_epochs=5):
"""
Train policy to reproduce successful configurations.
"""
examples = self.collect_successful_examples()
for epoch in range(num_epochs):
for batch_start in range(0, len(examples), batch_size):
batch = examples[batch_start:
batch_start + batch_size]
batch_loss = 0.0
for query, target_config in batch:
# Get policy output
policy_output = self.policy(query)
# Compute supervised loss
workflow_loss = self.compute_config_loss(
policy_output['structure']['logits']['workflow'],
target_config['workflow'])
tools_loss = self.compute_config_loss(
policy_output['structure']['logits']['tools'],
target_config['tools'])
loss = workflow_loss + tools_loss
batch_loss += loss.item()
# Backward pass
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
print(f"Epoch {epoch}: Average Loss={
batch_loss / len(batch):.4f}")
def compute_config_loss(self, policy_logits, target_config):
"""Cross-entropy loss between policy and target."""
# Convert target to probability distribution
target_probs = torch.zeros_like(policy_logits)
target_probs[target_config] = 1.0
# KL divergence
loss = torch.nn.functional.kl_div(
torch.nn.functional.log_softmax(policy_logits, dim=-1),
target_probs,
reduction='batchmean')
return loss
When to use:
System design:
Define configuration space:
Implement environment:
Create action masks:
Training pipeline:
Configuration space size:
Expected improvements:
Training schedule:
Phase 1 (SFT): 1,000 successful trajectory examples
Phase 2 (Masked RL): 5,000-10,000 exploration steps
Phase 3 (Refinement): 1,000-2,000 steps
Evaluation:
Common configurations to learn:
Simple factual queries:
- Workflow: sequential
- Tools: web_search only
- Budget: low (500-1000 tokens)
Complex reasoning queries:
- Workflow: hierarchical
- Tools: web_search + code_execution
- Budget: high (10000+ tokens)
Coding queries:
- Workflow: sequential
- Tools: code_executor + debugger
- Budget: medium-high (5000+ tokens)
Planning queries:
- Workflow: hierarchical
- Tools: reasoning_engine
- Budget: medium (2000-5000 tokens)
Hierarchical RL with action masking enables learning optimal dynamic configurations for agentic AI systems. By decomposing configuration into structure (workflow/tools/budget) and prompt (instructions) decisions and constraining the action space through masks, systems can adapt per-query while exploring efficiently in a high-dimensional configuration space.
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.