ajet/copilot/docker-install-agentjet-swarm-server/SKILL.md
Install and run the AgentJet Swarm Server in a Docker container with NVIDIA GPU support. Use when the user wants to deploy a swarm server on a GPU machine via Docker, including GPU driver setup, Docker mirror configuration, model weight mounting, and server startup.
npx skillsauth add modelscope/agentjet docker-install-agentjet-swarm-serverInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
when the user only need to run agentjet client, and do not have to run models locally (e.g. user in their laptop), ONLY install AgentJet basic requirements is enough (pip install -e .). see
install-agentjet-clientskill
This skill guides you through installing and running the AgentJet Swarm Server in a Docker container with GPU support.
Before proceeding, verify:
nvidia-smi
If this fails, the system may not have NVIDIA drivers or GPU hardware.
sudo apt update
sudo apt install docker docker.io curl
# Install Docker with convenience script
curl https://get.docker.com | sh \
&& sudo systemctl --now enable docker
# Add NVIDIA repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Install nvidia-docker2
sudo apt-get update
sudo apt-get install -y nvidia-docker2
# Restart Docker daemon
sudo systemctl restart docker
If pulling Docker images is too slow, configure a mirror registry:
# Create or edit Docker daemon config
sudo mkdir -p /etc/docker
sudo tee /etc/docker/daemon.json <<EOF
{
"registry-mirrors": [
"https://docker.1ms.run",
"https://docker.xuanyuan.me"
]
}
EOF
# Restart Docker
sudo systemctl daemon-reload
sudo systemctl restart docker
For ghcr.io images, use a mirror prefix:
# Original (may be slow)
docker pull ghcr.io/modelscope/agentjet:main
# Using mirror (faster in China)
docker pull ghcr.modelscope.cn/modelscope/agentjet:main
# Or use dockerhub mirror
docker pull docker.1ms.run/modelscope/agentjet:main
| Mirror | Region | Note |
|--------|--------|------|
| docker.1ms.run | China | General Docker Hub mirror |
| docker.xuanyuan.me | China | Alternative mirror |
| ghcr.modelscope.cn | China | GitHub Container Registry mirror |
| registry.docker-cn.com | China | Official Docker China mirror |
docker info | grep -A 5 "Registry Mirrors"
docker run --rm --gpus=all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
Download LLM model weights locally (e.g., Qwen2.5-7B-Instruct):
# Example using modelscope
pip install modelscope
modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct
# Create directories for logs and experiments
mkdir -p ./swarmlog ./swarmexp
# Run AgentJet Swarm Server
docker run --rm -it \
-v /path/to/host/Qwen/Qwen2.5-7B-Instruct:/Qwen/Qwen2.5-7B-Instruct \
-v ./swarmlog:/workspace/log \
-v ./swarmexp:/workspace/saved_experiments \
-p 10086:10086 \
-e SWANLAB_API_KEY=$SWANLAB_API_KEY \
--gpus=all \
--shm-size=32GB \
ghcr.io/modelscope/agentjet:main \
bash -c "(ajet-swarm overwatch) & (NO_COLOR=1 LOGURU_COLORIZE=NO ajet-swarm start &>/workspace/log/swarm_server.log)"
| Flag | Purpose |
|------|---------|
| --rm | Auto-remove container on exit |
| -it | Interactive TTY for TUI monitor |
| -v <host>:<container> | Mount model weights into container |
| -p 10086:10086 | Expose API port for Swarm Clients |
| --gpus=all | Use all available GPUs |
| --shm-size=32GB | Shared memory for large model inference |
After launch, you should see the ajet-swarm overwatch TUI showing server state transitions:
OFFLINE -> BOOTING -> ROLLING -> WEIGHT_SYNCING -> ROLLING -> ...
The server enters BOOTING only after a Swarm Client sends a training configuration.
From any machine that can reach the server:
from ajet.tuner_lib.experimental.swarm_client import SwarmClient
from ajet.copilot.job import AgentJetJob
swarm_worker = SwarmClient("http://<server-ip>:10086")
swarm_worker.auto_sync_train_config_and_start_engine(
AgentJetJob(
algorithm="grpo",
n_gpu=8,
model="/Qwen/Qwen2.5-7B-Instruct", # Container-side path
batch_size=32,
num_repeat=4,
)
)
| Symptom | Cause | Fix |
|---------|--------|-----|
| Server stays OFFLINE | No client connected | Run Swarm Client script |
| Model not found | Wrong container path | Verify -v mount matches model field |
| Cannot connect port 10086 | Firewall | Check firewall rules |
| Empty log file | Missing log directory | mkdir -p ./swarmlog |
| Image pull timeout | Slow registry access | Configure Docker mirror (Step 4) |
| Image pull fails | Wrong mirror URL | Try different mirror or use original URL |
data-ai
How `max_env_worker` caps the "Running Episodes" gauge, and how `AgentJetJob` relates to the YAML config.
tools
Convert skills in non-standard formats to the standard Agent Skills `SKILL.md` format. Validates YAML frontmatter (name, description, license, compatibility, metadata, allowed-tools), directory structure (SKILL.md, scripts/, references/, assets/), and best practices. Use when the user asks to normalize, validate, or fix a skill.
devops
Download per-step time-series metric data (reward, entropy, response length, etc.) from a SwanLab cloud run URL as a pandas.DataFrame. Use when the user provides a SwanLab URL and wants to fetch or analyze training curves.
development
Your task is to investigate the chat template of given model, go to its tokenizer config and check whether the following behavior exists: > > Remove history <think> block from the input when apply chat template when converting messages. > This behavior will make RL training slower, if this behavior exists, please change the chat template to forbid such behavior. You must not do this in-place, instead, please create another model. E.g., "/mnt/data_cpfs/xielipeng.xlp/models/Qwen3-8B" -> "/mnt