skills/can-implement-agent-based-odd-based/SKILL.md
Translate ODD protocol specifications into validated, executable agent-based model (ABM) code in Python. Use when the user says 'implement this ABM', 'convert ODD to code', 'build an agent-based model from this specification', 'replicate this NetLogo model in Python', 'translate this model description into a simulation', or 'create a predator-prey / ecological / social ABM'.
npx skillsauth add ndpvt-web/arxiv-claude-skills can-implement-agent-based-odd-basedInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to translate agent-based model (ABM) descriptions written in the ODD (Overview, Design concepts, Details) protocol into correct, validated Python implementations. Based on a rigorous replication study (Fachada et al., 2026) that tested 17 LLMs on ODD-to-code translation, this skill encodes the patterns that produce statistically faithful implementations and avoids the failure modes that cause behaviorally incorrect simulations -- even when the code runs without errors.
The ODD protocol (Grimm et al., 2006, 2010) is a standardized framework for describing agent-based models. It has three blocks: Overview (purpose, entities, state variables, process overview, scheduling), Design concepts (emergence, adaptation, sensing, interaction, stochasticity, observation), and Details (initialization, input data, submodels). The critical insight from the paper is that executability is insufficient for scientific validity -- code that runs and produces output can still be behaviorally wrong in subtle ways that only surface under statistical comparison.
The validated approach uses a staged evaluation pipeline with six levels: (1) code presence, (2) syntax correctness, (3) runtime success, (4) output format compliance, (5) statistical comparison against a reference baseline, and (6) distributional equivalence under multiple parameter regimes. Stages 1-4 are necessary but not sufficient; only implementations passing stage 6 are scientifically usable. The statistical comparison uses PCA-based dimensionality reduction on time-series outputs followed by Energy tests (nonparametric multivariate distributional comparison), with Benjamini-Hochberg correction for multiple testing.
The paper identifies specific implementation patterns that separate success from failure: correct asynchronous agent scheduling (random order within each tick), proper energy accounting across all operations, countdown-based state management for environmental processes, toroidal boundary handling, and post-iteration output collection. Models that synchronize agent actions, misapply reproduction probabilities, or skip environmental regeneration cycles produce code that appears correct but diverges statistically from the reference under load.
Parse the ODD specification into structured sections. Extract: purpose, entities and their state variables, spatial structure (grid dimensions, topology), process overview with explicit scheduling order, design concepts (especially stochasticity and interaction rules), initialization parameters, and submodel equations. If the user provides prose rather than formal ODD, restructure it into ODD sections before proceeding.
Define the function signature with explicit parameters. Create a single entry-point function (e.g., run_model()) that accepts all model parameters as typed arguments. Use only standard scientific Python (numpy, pandas). List every parameter from the ODD's initialization and submodel sections -- do not hardcode values.
Implement entity classes or data structures for each agent type and environmental cell. Each entity needs: all state variables from the ODD, methods for each submodel action (move, eat, reproduce, die), and energy tracking. Use numpy arrays for grid state if performance matters; use simple classes if clarity matters.
Implement the scheduling loop in exact ODD order. This is the most failure-prone step. Follow the process overview literally: if the ODD says "agents act in random order," shuffle the agent list each tick. If it says "movement, then feeding, then reproduction," execute those phases sequentially within each tick. Do not batch or parallelize phases unless the ODD explicitly allows it.
Implement each submodel as a separate function matching the ODD's Details section. For movement: respect the specified neighborhood (von Neumann vs. Moore) and topology (toroidal wrapping). For feeding: apply energy gains/losses exactly. For reproduction: check energy thresholds before applying probability, then split energy between parent and offspring. For death: trigger when energy reaches zero, removing the agent from the active list.
Implement environmental dynamics as cell-level state machines. Food regeneration, resource depletion, or other grid processes need countdown timers or state flags per cell. Process these at the correct point in the scheduling loop, not as agent actions.
Collect output metrics at the correct phase of each iteration. Record population counts, mean energy values, and environmental state after all agent actions for that tick are complete. Return results as a pandas DataFrame with one row per iteration and named columns matching the ODD's observation section.
Run a smoke test with minimal parameters (3-5 iterations) to verify executability and output format. Check: function returns a DataFrame, column names match specification, no runtime errors, values are in plausible ranges.
Run full validation: execute 30 independent replications under at least two parameter regimes. Compare output distributions against reference data using PCA + Energy test (or Mann-Whitney U per metric if reference distributions are available). Statistical equivalence at alpha=0.01 with Benjamini-Hochberg correction is the target.
Measure code quality: compute SLOC, cyclomatic complexity, and lint/type-check density. Target: maintainability index above 50, type error density below 5 per 100 SLOC, minimal ruff/pylint warnings. Refactor only if quality metrics are poor -- do not over-engineer a passing implementation.
Example 1: PPHPC Predator-Prey Model
User: "Implement the PPHPC predator-prey model. Grid is 100x100 toroidal. Prey eat food on cells, predators eat prey. Both reproduce probabilistically above energy thresholds. Food regenerates after a countdown."
Approach:
run_pphpc(xdim, ydim, n_prey, n_pred, n_iters, prey_gain, pred_gain, prey_loss, pred_loss, prey_repro_thresh, pred_repro_thresh, prey_repro_prob, pred_repro_prob, food_restart) returning a DataFrameOutput:
import numpy as np
import pandas as pd
def run_pphpc(xdim=100, ydim=100, n_prey=400, n_pred=200, n_iters=4000,
prey_gain=4, pred_gain=20, prey_loss=1, pred_loss=1,
prey_repro_thresh=4, pred_repro_thresh=20,
prey_repro_prob=0.25, pred_repro_prob=0.05,
food_restart=10):
# Grid: food_present (bool), restart_counter (int)
food = np.ones((xdim, ydim), dtype=bool)
counters = np.zeros((xdim, ydim), dtype=int)
# Agent lists: [x, y, energy]
prey = [[np.random.randint(xdim), np.random.randint(ydim),
np.random.randint(1, 2 * prey_gain)] for _ in range(n_prey)]
predators = [[np.random.randint(xdim), np.random.randint(ydim),
np.random.randint(1, 2 * pred_gain)] for _ in range(n_pred)]
records = []
offsets = [(-1, 0), (1, 0), (0, -1), (0, 1)] # von Neumann
for t in range(n_iters):
# Movement (all agents, random order)
all_agents = [(a, 'prey') for a in prey] + [(a, 'pred') for a in predators]
np.random.shuffle(all_agents) # ... (movement logic with toroidal wrap)
# Food regeneration
regen_mask = (~food) & (counters <= 0)
# ... (countdown decrement, food restore)
# Prey feeding, predator hunting, reproduction, death removal
# ... (full submodel implementations)
# Collect output
records.append({
'total_prey': len(prey), 'total_predators': len(predators),
'total_food': int(food.sum()),
'mean_energy_prey': np.mean([a[2] for a in prey]) if prey else 0,
'mean_energy_predators': np.mean([a[2] for a in predators]) if predators else 0,
'mean_c': counters[~food].mean() if (~food).any() else 0,
})
return pd.DataFrame(records)
Example 2: Schelling Segregation Model from ODD Description
User: "Here is the ODD for a Schelling segregation model. Grid 50x50, two agent types (40% each, 20% empty). Agents are unhappy if fewer than 30% of neighbors are same type. Unhappy agents move to random empty cell. Run until stable or 500 ticks."
Approach:
run_schelling(dim=50, frac_a=0.4, frac_b=0.4, tolerance=0.3, max_iters=500)Output:
def run_schelling(dim=50, frac_a=0.4, frac_b=0.4, tolerance=0.3, max_iters=500):
grid = np.zeros((dim, dim), dtype=int) # 0=empty, 1=type_a, 2=type_b
# Initialize populations...
records = []
for t in range(max_iters):
happiness = compute_happiness(grid, tolerance) # all at once
unhappy = list(zip(*np.where(~happiness & (grid > 0))))
np.random.shuffle(unhappy)
empty = list(zip(*np.where(grid == 0)))
moves = 0
for (r, c) in unhappy:
if empty:
nr, nc = empty.pop(np.random.randint(len(empty)))
grid[nr, nc] = grid[r, c]
grid[r, c] = 0
empty.append((r, c))
moves += 1
records.append({'tick': t, 'frac_happy': happiness[grid > 0].mean(),
'moves': moves})
if moves == 0:
break
return pd.DataFrame(records)
Example 3: Validating a Generated ABM Against Reference Data
User: "I have 30 CSV files from a NetLogo reference run and 30 from my Python implementation. Are they statistically equivalent?"
Approach:
Output:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from scipy.spatial.distance import cdist
def validate_abm(ref_runs, gen_runs, alpha=0.01):
"""ref_runs, gen_runs: lists of DataFrames with identical columns."""
# Stack and standardize
all_data = pd.concat(ref_runs + gen_runs)
scaler = StandardScaler().fit(all_data)
ref_scaled = [scaler.transform(r) for r in ref_runs]
gen_scaled = [scaler.transform(g) for g in gen_runs]
# PCA
pca = PCA(n_components=0.8).fit(np.vstack(ref_scaled))
ref_pca = [pca.transform(r) for r in ref_scaled]
gen_pca = [pca.transform(g) for g in gen_scaled]
# Energy test per time step or aggregated
# ... (compute energy statistic and permutation p-value)
# Benjamini-Hochberg correction
# ... (adjust p-values, compare to alpha)
return results_df # columns: metric, p_value, adjusted_p, equivalent
if len(agents) > 0 before computing means; record 0 or NaN for extinct populations.Fachada, N., Fernandes, D., Fernandes, C. M., & Matos-Carvalho, J. P. (2026). Can Large Language Models Implement Agent-Based Models? An ODD-based Replication Study. arXiv:2602.10140v1. Focus on: the six-stage evaluation pipeline (Table 2), the PPHPC ODD specification (Section 3), common failure modes by stage (Section 5), and the PCA + Energy test validation methodology (Section 4.3).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".