skills/llm/actor-critic-game-agent/SKILL.md
Shared-backbone neural network with actor (policy) and critic (value) heads for grid-based game agent RL training
npx skillsauth add wenmin-wu/ds-skills llm-actor-critic-game-agentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
For grid-based game agents (Halite, Lux AI, microRTS), an actor-critic architecture shares a feature backbone and splits into two heads: the actor outputs action probabilities, the critic estimates state value. Training uses the advantage (return - value) to update the actor via policy gradient and the critic via Huber loss. This is more sample-efficient than pure policy gradient and more stable than pure value-based methods.
import tensorflow as tf
import numpy as np
def build_actor_critic(state_dim, num_actions):
inputs = tf.keras.Input(shape=(state_dim,))
x = tf.keras.layers.Dense(128, activation='tanh')(inputs)
x = tf.keras.layers.Dense(32, activation='tanh')(x)
actor = tf.keras.layers.Dense(num_actions, activation='softmax')(x)
cx = tf.keras.layers.Dense(128, activation='relu')(inputs)
cx = tf.keras.layers.Dense(32, activation='relu')(cx)
critic = tf.keras.layers.Dense(1)(cx)
return tf.keras.Model(inputs=inputs, outputs=[actor, critic])
def compute_returns(rewards, gamma=0.99):
returns = []
discounted = 0
for r in reversed(rewards):
discounted = r + gamma * discounted
returns.insert(0, discounted)
returns = np.array(returns)
return (returns - returns.mean()) / (returns.std() + 1e-8)
model = build_actor_critic(state_dim=441, num_actions=5)
-log_prob * advantage, critic with Huber lossdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF