Container Orchestration & Dev Stacks

The Pattern

Problem: You're executing tasks in containers (one per task). Those tasks can fork-bomb, exhaust memory, run forever, or leave orphan containers after a crash. You need safety limits, monitoring, and cleanup — plus optional sidecar services (databases, caches, auxiliary APIs).

Approach: Hard container limits (PID, memory, CPU, lifetime), a watchdog loop that polls docker stats and kills violators, orphan recovery on restart, and sidecar provisioning with bind-mounted persistent data.

Pattern proven in production across multiple Python CLI tools and web services.

Key Design Decisions

1. Container safety limits — the runaway processes incident

Safety limits exist because of a real incident: in one production deployment, over 4,000 runaway test processes consumed 103Gi of RAM and caused OOM kills across the host.

# Container safety limits — prevent fork bomb and memory exhaustion incidents.
# These values were determined after a real incident where thousands of runaway
# processes consumed all available RAM and caused OOM kills.
CONTAINER_PIDS_LIMIT = 256
CONTAINER_MEMORY_LIMIT = "8g"
CONTAINER_MEMORY_SWAP_LIMIT = "8g"
CONTAINER_CPU_LIMIT = 2.0
MAX_INSTANCE_LIFETIME_SECONDS = 12 * 60 * 60  # 12 hours

These are passed to docker create as resource constraints. The PID limit is the most critical — it prevents fork bombs from escaping the container's cgroup.

2. Watchdog monitoring loop

The watchdog runs as a background asyncio task, polling every 5 minutes:

async def watchdog_loop(self, instance_store, interval=300):
    while True:
        for instance_id, info in list(self._active.items()):
            await self._watchdog_check_instance(instance_id, info, instance_store)
        await asyncio.sleep(interval)

async def _watchdog_check_instance(self, instance_id, info, instance_store):
    container_name = info.container_name

    # Check 1: Lifetime
    if age_seconds > MAX_INSTANCE_LIFETIME_SECONDS:
        await self._watchdog_destroy(instance_id, ...)
        return

    # Check 2 & 3: PIDs and Memory (single docker stats call)
    rc, stdout, _ = await self._client._run_docker(
        "stats", "--no-stream", "--format", "{{.PIDs}} {{.MemPerc}}",
        container_name)
    parts = stdout.strip().split()
    pid_count = int(parts[0])
    mem_perc = float(parts[1].rstrip("%"))

    if pid_count > _WATCHDOG_PID_THRESHOLD:     # 200
        await self._watchdog_destroy(...)
        return
    if mem_perc > _WATCHDOG_MEMORY_PERCENT_THRESHOLD:  # 80%
        await self._watchdog_destroy(...)
        return

Key design: the watchdog uses docker stats --no-stream with a format string to get both PID count and memory percentage in a single call. This minimizes Docker API overhead.

The thresholds (_WATCHDOG_PID_THRESHOLD = 200, _WATCHDOG_MEMORY_PERCENT_THRESHOLD = 80.0) are below the hard limits (CONTAINER_PIDS_LIMIT = 256, CONTAINER_MEMORY_LIMIT = "8g"). This gives the watchdog a chance to detect and kill containers before they hit the hard limit and get OOM-killed by the kernel.

3. Watchdog destroy — cleanup with sidecar awareness

Destroying a container also destroys its sidecar containers:

async def _watchdog_destroy(self, instance_id, container_name, instance_store):
    # Destroy the main container
    await self._client.destroy_container(container_name)
    # Destroy sidecar if present
    info = self._active.get(instance_id)
    if info is not None and info.sidecar_env_id is not None:  # Destroy companion containers if your architecture uses them
        await destroy_sidecar(info.sidecar_env_id)
    # Update status and remove from active tracking
    instance_store.update_instance(instance_id, status="cancelled")
    self._active.pop(instance_id, None)

4. Orphan container recovery on service restart

When the orchestrator starts, it checks for containers that were active before the crash/restart. The lifespan preserves active instances across upgrades:

old_orchestrator = app.state.orchestrator
new_orchestrator = Orchestrator(client=client)
if old_orchestrator and hasattr(old_orchestrator, "_active"):
    new_orchestrator._active.update(old_orchestrator._active)
    logger.info("Preserved %d active instances during orchestrator upgrade",
                len(old_orchestrator._active))

5. Sidecar provisioning — persistent data via bind mounts

Each task instance can get a dedicated sidecar container (e.g., a database, cache, or auxiliary API):

async def create_sidecar_for_instance(instance_id: str) -> SidecarInfo | None:
    """Create a sidecar with data bind-mounted to the instance directory."""
    host_data_path = get_instance_dir(instance_id) / SIDECAR_DATA_DIR
    host_data_path.mkdir(parents=True, exist_ok=True)
    return await asyncio.to_thread(_create_sidecar_sync, docker_network,
                                    str(host_data_path))

The bind mount at /data means sidecar data (repos, databases) survives container destruction — it persists in the instance directory.

Network-aware URLs handle the container-vs-host split:

url = "http://sidecar:3000" if docker_network else f"http://host.docker.internal:{port}"

When containers share a Docker network, they reach the sidecar via DNS name. When not on a shared network (dev mode), they use host.docker.internal to reach the host-mapped port.

Template / Starter Code

# orchestrator.py — container lifecycle with safety limits and watchdog
import asyncio, time, logging, subprocess

PIDS_LIMIT = 256
MEMORY_LIMIT = "8g"
CPU_LIMIT = 2.0
MAX_LIFETIME = 12 * 60 * 60  # 12 hours
WATCHDOG_PID_THRESHOLD = 200
WATCHDOG_MEM_THRESHOLD = 80.0

logger = logging.getLogger(__name__)

class Orchestrator:
    def __init__(self):
        self._active: dict[str, dict] = {}  # instance_id → {container, started_at}

    async def start_instance(self, instance_id: str, image: str, cmd: list[str]):
        container_name = f"inst-{instance_id[:12]}"
        proc = await asyncio.create_subprocess_exec(
            "docker", "create",
            "--name", container_name,
            "--pids-limit", str(PIDS_LIMIT),
            "--memory", MEMORY_LIMIT,
            "--cpus", str(CPU_LIMIT),
            image, *cmd,
            stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE)
        await proc.communicate()
        await asyncio.create_subprocess_exec("docker", "start", container_name)
        self._active[instance_id] = {
            "container": container_name,
            "started_at": time.time(),
        }

    async def watchdog_loop(self, interval: float = 300):
        while True:
            for iid, info in list(self._active.items()):
                await self._check(iid, info)
            await asyncio.sleep(interval)

    async def _check(self, instance_id: str, info: dict):
        container = info["container"]
        age = time.time() - info["started_at"]
        if age > MAX_LIFETIME:
            logger.warning("Lifetime exceeded for %s", instance_id)
            await self._destroy(instance_id, container)
            return
        proc = await asyncio.create_subprocess_exec(
            "docker", "stats", "--no-stream", "--format", "{{.PIDs}} {{.MemPerc}}",
            container, stdout=asyncio.subprocess.PIPE)
        stdout, _ = await proc.communicate()
        parts = stdout.decode().strip().split()
        if len(parts) >= 2:
            pids = int(parts[0])
            mem = float(parts[1].rstrip("%"))
            if pids > WATCHDOG_PID_THRESHOLD or mem > WATCHDOG_MEM_THRESHOLD:
                logger.warning("Resource violation: %s (pids=%d, mem=%.1f%%)",
                               instance_id, pids, mem)
                await self._destroy(instance_id, container)

    async def _destroy(self, instance_id: str, container: str):
        proc = await asyncio.create_subprocess_exec(
            "docker", "rm", "-f", container)
        await proc.communicate()
        self._active.pop(instance_id, None)

Gotchas & Lessons Learned

The runaway processes incident. Before PID limits existed, an automated agent ran a test suite in a loop. Each test process forked subprocesses. The container had no --pids-limit, so the cascade consumed over 100Gi of RAM and OOM-killed other workloads on the host. The fix was twofold: hard Docker --pids-limit=256 AND a software process guard that kills orphan test patterns between commands.
Watchdog thresholds must be below hard limits. The watchdog threshold for PIDs (200) is below the Docker limit (256). If the watchdog only fired at 256, the container might already be stuck in a fork bomb where new processes can't spawn but existing ones consume resources. The gap gives the watchdog a window to act.
docker stats --no-stream is the cheapest monitoring. A single docker stats call returns PIDs and memory in one shot. The format string {{.PIDs}} {{.MemPerc}} extracts just what we need. Alternative approaches (reading cgroup files, Docker API) are more complex for no benefit.
Sidecar data must be bind-mounted for persistence. Without the bind mount, destroying the sidecar container destroys all data created during the instance's run. The bind mount to the instance directory means data survives even after the sidecar is cleaned up.
The host.docker.internal vs Docker network split. In dev mode (no shared Docker network), containers reach host services via host.docker.internal. In production (shared network), they use container DNS names. Your service abstraction should handle this with separate host and container_host fields.

Container Orchestration & Dev Stacks

The Pattern

Pattern proven in production across multiple Python CLI tools and web services.

Key Design Decisions

1. Container safety limits — the runaway processes incident

Safety limits exist because of a real incident: in one production deployment, over 4,000 runaway test processes consumed 103Gi of RAM and caused OOM kills across the host.

# Container safety limits — prevent fork bomb and memory exhaustion incidents.
# These values were determined after a real incident where thousands of runaway
# processes consumed all available RAM and caused OOM kills.
CONTAINER_PIDS_LIMIT = 256
CONTAINER_MEMORY_LIMIT = "8g"
CONTAINER_MEMORY_SWAP_LIMIT = "8g"
CONTAINER_CPU_LIMIT = 2.0
MAX_INSTANCE_LIFETIME_SECONDS = 12 * 60 * 60  # 12 hours

These are passed to docker create as resource constraints. The PID limit is the most critical — it prevents fork bombs from escaping the container's cgroup.

2. Watchdog monitoring loop

The watchdog runs as a background asyncio task, polling every 5 minutes:

async def watchdog_loop(self, instance_store, interval=300):
    while True:
        for instance_id, info in list(self._active.items()):
            await self._watchdog_check_instance(instance_id, info, instance_store)
        await asyncio.sleep(interval)

async def _watchdog_check_instance(self, instance_id, info, instance_store):
    container_name = info.container_name

    # Check 1: Lifetime
    if age_seconds > MAX_INSTANCE_LIFETIME_SECONDS:
        await self._watchdog_destroy(instance_id, ...)
        return

    # Check 2 & 3: PIDs and Memory (single docker stats call)
    rc, stdout, _ = await self._client._run_docker(
        "stats", "--no-stream", "--format", "{{.PIDs}} {{.MemPerc}}",
        container_name)
    parts = stdout.strip().split()
    pid_count = int(parts[0])
    mem_perc = float(parts[1].rstrip("%"))

    if pid_count > _WATCHDOG_PID_THRESHOLD:     # 200
        await self._watchdog_destroy(...)
        return
    if mem_perc > _WATCHDOG_MEMORY_PERCENT_THRESHOLD:  # 80%
        await self._watchdog_destroy(...)
        return

Key design: the watchdog uses docker stats --no-stream with a format string to get both PID count and memory percentage in a single call. This minimizes Docker API overhead.

3. Watchdog destroy — cleanup with sidecar awareness

Destroying a container also destroys its sidecar containers:

async def _watchdog_destroy(self, instance_id, container_name, instance_store):
    # Destroy the main container
    await self._client.destroy_container(container_name)
    # Destroy sidecar if present
    info = self._active.get(instance_id)
    if info is not None and info.sidecar_env_id is not None:  # Destroy companion containers if your architecture uses them
        await destroy_sidecar(info.sidecar_env_id)
    # Update status and remove from active tracking
    instance_store.update_instance(instance_id, status="cancelled")
    self._active.pop(instance_id, None)

4. Orphan container recovery on service restart

When the orchestrator starts, it checks for containers that were active before the crash/restart. The lifespan preserves active instances across upgrades:

old_orchestrator = app.state.orchestrator
new_orchestrator = Orchestrator(client=client)
if old_orchestrator and hasattr(old_orchestrator, "_active"):
    new_orchestrator._active.update(old_orchestrator._active)
    logger.info("Preserved %d active instances during orchestrator upgrade",
                len(old_orchestrator._active))

5. Sidecar provisioning — persistent data via bind mounts

Each task instance can get a dedicated sidecar container (e.g., a database, cache, or auxiliary API):

async def create_sidecar_for_instance(instance_id: str) -> SidecarInfo | None:
    """Create a sidecar with data bind-mounted to the instance directory."""
    host_data_path = get_instance_dir(instance_id) / SIDECAR_DATA_DIR
    host_data_path.mkdir(parents=True, exist_ok=True)
    return await asyncio.to_thread(_create_sidecar_sync, docker_network,
                                    str(host_data_path))

The bind mount at /data means sidecar data (repos, databases) survives container destruction — it persists in the instance directory.

Network-aware URLs handle the container-vs-host split:

url = "http://sidecar:3000" if docker_network else f"http://host.docker.internal:{port}"

When containers share a Docker network, they reach the sidecar via DNS name. When not on a shared network (dev mode), they use host.docker.internal to reach the host-mapped port.

Template / Starter Code

# orchestrator.py — container lifecycle with safety limits and watchdog
import asyncio, time, logging, subprocess

PIDS_LIMIT = 256
MEMORY_LIMIT = "8g"
CPU_LIMIT = 2.0
MAX_LIFETIME = 12 * 60 * 60  # 12 hours
WATCHDOG_PID_THRESHOLD = 200
WATCHDOG_MEM_THRESHOLD = 80.0

logger = logging.getLogger(__name__)

class Orchestrator:
    def __init__(self):
        self._active: dict[str, dict] = {}  # instance_id → {container, started_at}

    async def start_instance(self, instance_id: str, image: str, cmd: list[str]):
        container_name = f"inst-{instance_id[:12]}"
        proc = await asyncio.create_subprocess_exec(
            "docker", "create",
            "--name", container_name,
            "--pids-limit", str(PIDS_LIMIT),
            "--memory", MEMORY_LIMIT,
            "--cpus", str(CPU_LIMIT),
            image, *cmd,
            stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE)
        await proc.communicate()
        await asyncio.create_subprocess_exec("docker", "start", container_name)
        self._active[instance_id] = {
            "container": container_name,
            "started_at": time.time(),
        }

    async def watchdog_loop(self, interval: float = 300):
        while True:
            for iid, info in list(self._active.items()):
                await self._check(iid, info)
            await asyncio.sleep(interval)

    async def _check(self, instance_id: str, info: dict):
        container = info["container"]
        age = time.time() - info["started_at"]
        if age > MAX_LIFETIME:
            logger.warning("Lifetime exceeded for %s", instance_id)
            await self._destroy(instance_id, container)
            return
        proc = await asyncio.create_subprocess_exec(
            "docker", "stats", "--no-stream", "--format", "{{.PIDs}} {{.MemPerc}}",
            container, stdout=asyncio.subprocess.PIPE)
        stdout, _ = await proc.communicate()
        parts = stdout.decode().strip().split()
        if len(parts) >= 2:
            pids = int(parts[0])
            mem = float(parts[1].rstrip("%"))
            if pids > WATCHDOG_PID_THRESHOLD or mem > WATCHDOG_MEM_THRESHOLD:
                logger.warning("Resource violation: %s (pids=%d, mem=%.1f%%)",
                               instance_id, pids, mem)
                await self._destroy(instance_id, container)

    async def _destroy(self, instance_id: str, container: str):
        proc = await asyncio.create_subprocess_exec(
            "docker", "rm", "-f", container)
        await proc.communicate()
        self._active.pop(instance_id, None)

Gotchas & Lessons Learned

The runaway processes incident. Before PID limits existed, an automated agent ran a test suite in a loop. Each test process forked subprocesses. The container had no --pids-limit, so the cascade consumed over 100Gi of RAM and OOM-killed other workloads on the host. The fix was twofold: hard Docker --pids-limit=256 AND a software process guard that kills orphan test patterns between commands.
Watchdog thresholds must be below hard limits. The watchdog threshold for PIDs (200) is below the Docker limit (256). If the watchdog only fired at 256, the container might already be stuck in a fork bomb where new processes can't spawn but existing ones consume resources. The gap gives the watchdog a window to act.
docker stats --no-stream is the cheapest monitoring. A single docker stats call returns PIDs and memory in one shot. The format string {{.PIDs}} {{.MemPerc}} extracts just what we need. Alternative approaches (reading cgroup files, Docker API) are more complex for no benefit.
Sidecar data must be bind-mounted for persistence. Without the bind mount, destroying the sidecar container destroys all data created during the instance's run. The bind mount to the instance directory means data survives even after the sidecar is cleaned up.
The host.docker.internal vs Docker network split. In dev mode (no shared Docker network), containers reach host services via host.docker.internal. In production (shared network), they use container DNS names. Your service abstraction should handle this with separate host and container_host fields.

Adoption

microsoft/container-orchestration-patterns

$ install --global

Security Scan Results

SKILL.md

Container Orchestration & Dev Stacks

The Pattern

Key Design Decisions

1. Container safety limits — the runaway processes incident

2. Watchdog monitoring loop

3. Watchdog destroy — cleanup with sidecar awareness

4. Orphan container recovery on service restart

5. Sidecar provisioning — persistent data via bind mounts

Template / Starter Code

Gotchas & Lessons Learned

Related Skills

microsoft/council-here

microsoft/council

microsoft/msgraph-integration-patterns

microsoft/amplifier-tool-leverage-patterns

microsoft/container-orchestration-patterns

$ install --global

Security Scan Results

SKILL.md

Container Orchestration & Dev Stacks

The Pattern

Key Design Decisions

1. Container safety limits — the runaway processes incident

2. Watchdog monitoring loop

3. Watchdog destroy — cleanup with sidecar awareness

4. Orphan container recovery on service restart

5. Sidecar provisioning — persistent data via bind mounts

Template / Starter Code

Gotchas & Lessons Learned

Related Skills

microsoft/council-here

microsoft/council

microsoft/msgraph-integration-patterns

microsoft/amplifier-tool-leverage-patterns