skills/telegram/SKILL.md
Use Telegram as the agent's I/O channel. Once triggered, the agent enters a REPL state — reading instructions from TG, executing them, printing results back to TG, and looping. Use when the user asks to be notified, messaged, or alerted via Telegram, or wants to interact with the agent through TG. This is a cross-cutting skill — other skills (batch-sweep, model-config, job-triage) can trigger it when the user explicitly requests it.
npx skillsauth add AMD-AGI/maxtext-slurm telegramInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use Telegram as the agent's I/O channel via utils/telegram_bot.sh. Once activated, the agent enters a REPL state: all results go to TG, all instructions come from TG, and the loop runs until timeout or the user explicitly exits. Trigger phrases: "send me a TG message", "notify me when done", "alert me on Telegram", "wait for my TG reply".
NEVER read, print, or log the contents of ~/.tg_config. Only check if the file exists. The script loads credentials internally.
If the user specifies a bot profile (e.g., "use the alerts channel"), pass -b <name> on every send/recv call for the rest of the session. The choice is sticky — do not revert to default until the user explicitly asks to switch.
Try in order. Stop at the first success.
test -f ~/.tg_config && echo "EXISTS" || echo "NOT FOUND"
If ~/.tg_config exists, send directly:
utils/telegram_bot.sh send "Your message here"
utils/telegram_bot.sh -b alerts send "Your message here" # named profile
For multi-line messages, pipe from stdin:
echo "Line 1
Line 2" | utils/telegram_bot.sh send
If this succeeds, done. If ~/.tg_config doesn't exist, go to Step 2.
Check if host-cmd is available:
python3 /maxtext-slurm/.host-cmd/host_cmd.py --ping --timeout 5
If it does NOT respond ALIVE, go to Step 3.
If alive, check host credentials exist (never read or print the file contents):
python3 /maxtext-slurm/.host-cmd/host_cmd.py --timeout 10 "test -f ~/.tg_config && echo EXISTS || echo NOT_FOUND"
If credentials exist, send using the base64-to-file pattern. This is the safest host-cmd path because direct quoting through host-cmd breaks on special characters, and even a heredoc embedded inside a double-quoted host-cmd argument can be mangled by the local shell before the command reaches the host.
# Encode the message locally so Markdown backticks, $(), quotes, and newlines survive.
ENCODED=$(python3 - <<'PY'
import base64
msg = """Your message here.
Multiple lines are fine.
Markdown `backticks`, $(subshells), and emoji are safe.
"""
print(base64.b64encode(msg.encode()).decode())
PY
)
# Decode into a temp file on the host
python3 /maxtext-slurm/.host-cmd/host_cmd.py --timeout 10 \
"python3 -c \"import base64; open('/tmp/tg_msg.txt','w').write(base64.b64decode('$ENCODED').decode())\""
# Pipe the file into telegram_bot.sh
python3 /maxtext-slurm/.host-cmd/host_cmd.py --timeout 15 \
"cat /tmp/tg_msg.txt | bash utils/telegram_bot.sh send"
To use a named profile via host-cmd: bash utils/telegram_bot.sh -b alerts send
If this succeeds, done. If credentials don't exist on the host either, go to Step 3.
Tell the user what failed and give minimal next steps. Example:
Could not send Telegram message — no credentials found.
~/.tg_configis missing in both the container and the host. Want me to help set up a Telegram bot? (takes ~2 minutes)
Or if host-cmd is unavailable:
Could not send Telegram message —
~/.tg_confignot found in the container, and host-cmd is not available. Want me to help set up Telegram credentials locally?
If the user says yes, walk them through docs/notifications.md setup:
/newbot → get bot tokenhttps://api.telegram.org/bot<TOKEN>/getUpdates~/.tg_config with install -m 600 /dev/null ~/.tg_config, then add:
BotToken <token>
ChatID <chat_id>
utils/telegram_bot.sh send "Hello from $(hostname)"| Problem | Cause | Fix |
|---------|-------|-----|
| syntax error near unexpected token / command not found before host-cmd runs | Local shell expanded backticks or $() inside the message | Use the base64-to-file pattern above, not inline text or a heredoc embedded in a double-quoted host-cmd argument |
| command not found on host | utils/ path is relative to repo root | host-cmd cwd is already the repo root; bash utils/telegram_bot.sh works |
| Empty message error | Heredoc EOF marker was indented | Use unindented EOF marker |
Keep messages concise. Telegram has a 4096-char limit per message (the script auto-splits longer messages). Use Markdown formatting — the script sends with PARSE_MODE=Markdown by default.
Wrap technical identifiers in backtick code spans — file paths, config names, variable names, model names. Telegram does not parse Markdown inside backticks, so underscores and brackets are safe:
*Sweep complete*
`config_llama_70b_batch_8`: 1250 TGS
`config_llama_70b_batch_16`: 892 TGS
_Best result_: `per_device_batch_size`=4
This is the key rule for avoiding Markdown conflicts. *bold*, _italic_, and `code` all work naturally as long as dynamic content with underscores goes in backticks.
If Markdown still fails (e.g., unmatched _ the agent missed), the script automatically retries as plain text — the message always gets delivered.
For multi-line code or shell output, use triple-backtick code blocks:
```
step 100: loss=2.31, TGS=1250.3
step 200: loss=2.15, TGS=1248.7
```
Once the agent sends its first TG result, it enters REPL mode — Telegram becomes the I/O channel. The agent reads instructions from TG, executes them, prints results back to TG, and loops. Only two things exit REPL mode: (1) recv timeout, or (2) the user explicitly asks to stop listening. Everything else — results, acknowledgments, errors, clarifications — feeds back into the loop.
Critical invariant: the REPL belongs to the assistant, not to any single shell command. A recv invocation is only one blocking wait inside the loop. While REPL mode is active, the assistant must keep the turn alive: do NOT send a final response, do NOT end the turn, and do NOT treat a completed recv process as the end of the REPL. A recv exit code of 0 means a Telegram message arrived and the loop must continue.
Print — send the result (if any), then send the prompt. The result uses Markdown (wrap technical identifiers in backticks — see Message formatting above). If looping back after a bare acknowledgment ("ok", "thanks"), skip the result and send only the prompt. The prompt is its own message:
━━━━━━━━━━━━━━━━━━━━ 💬 *Awaiting further instructions* ⏳ _Timeout: {duration}_ ━━━━━━━━━━━━━━━━━━━━
Replace {duration} with the actual timeout in the most natural unit (e.g., "10 minutes", "1 hour", "2 hours"). The *...* renders as bold and _..._ renders as italic in Telegram. The ━ line, 💬, and ⏳ are literal characters. Send this prompt at every loop-back point (after results, after acks) — NOT after echo messages (step 3) or progress reports (step 8).
recv to wait for the user's input. Background it immediately so you can poll:From the container:
utils/telegram_bot.sh recv --timeout 600
Via host-cmd (set host-cmd timeout slightly above recv timeout):
python3 /maxtext-slurm/.host-cmd/host_cmd.py --timeout 660 \
"bash utils/telegram_bot.sh recv --timeout 600"
Run with block_until_ms: 0 to background it, then poll the terminal file every ~30 seconds.
After starting recv, keep monitoring that background command until it either times out or returns a message. The assistant remains responsible for the REPL state the entire time.
Eval — on reply, read the user's instructions from the command output.
If the recv process exited successfully (exit_code: 0), that is NOT an exit condition. It means new Telegram input is ready to evaluate. Read the output, handle the message, send the result and prompt as needed, then start the next recv.
Echo before executing — immediately send a short TG message paraphrasing what you understood and that you're starting work. This confirms receipt, lets the user catch misinterpretations early, and sets expectations for longer tasks. Example: "Got it: re-run the sweep with Y=5. Working on it..."
Multi-message handling — if multiple messages arrived at once (newline-separated in the output), read them all before doing anything. Classify them:
In all cases, the echo message should reflect your interpretation of every received message so the user sees exactly what you plan to do.
Peek before executing — after sending the echo, do a quick recv --timeout 1 before starting work. This catches last-second corrections the user sent after recv returned but before the echo was delivered (e.g., "wait, actually use Y=10"). If a correction is found, incorporate it, send a new echo reflecting the updated plan, and peek again (the user may send further adjustments after seeing the new echo). Only start execution once a peek comes back empty. Classify any found message using the same rules as mid-task peek (step 8).
After executing, send the result. If execution failed or errored, the error IS the result — send it as such. Always loop back to step 1 (prompt + recv). No exceptions — the loop is a REPL.
On timeout (recv exits with code 1, output contains "Timeout"): send a final TG message so the user knows the agent stopped listening, then end the loop. Report in the agent chat as well:
TG message:
━━━━━━━━━━━━━━━━━━━━ ⏱ *Timed out — no longer listening* ⚠️ _Back to the agent chat to continue_ ━━━━━━━━━━━━━━━━━━━━
Agent chat: "TG interactive loop ended — no reply within timeout."
Exit (explicit only): end the REPL ONLY if the user explicitly asks to stop listening. Exit phrases: "stop listening", "exit loop", "done with TG", "that's all". Bare acknowledgments ("ok", "thanks", "done", "got it") are NOT exit signals — they mean the user saw the result; loop back to step 1. This rule applies everywhere: at the recv boundary AND during mid-task peek (step 8). When exiting, send a TG confirmation and report in the agent chat:
TG message:
━━━━━━━━━━━━━━━━━━━━ ✅ *Acknowledged — no longer listening* ⚠️ _Back to the agent chat to continue_ ━━━━━━━━━━━━━━━━━━━━
Agent chat: "TG interactive loop ended — user acknowledged."
Ad-hoc timeout changes: if the user asks to change the wait time (e.g. "increase timeout to 20 min"), adjust the --timeout flag on subsequent recv calls in the current session. Do NOT modify the script's default timeout or any docs/config — just pass --timeout 1200 (or whatever the user requests) for the remainder of the loop.
One loop at a time: Telegram's getUpdates API is per-bot — any recv that confirms a message purges it from the queue for ALL consumers. Do NOT run interactive loops in multiple concurrent sessions with the same bot token. Only one session should use recv at a time; other sessions can still send.
Progress reports and mid-task peek: while executing a user's instruction (between the echo and the result message), send intermediate progress updates for tasks that take more than a few minutes. Two strategies, not mutually exclusive:
Anti-spam cap: regardless of strategy, send no more than 2 progress messages per hour. This forces selectivity — if many milestones occur in a short window, batch or skip most of them.
Peek for instructions: after each progress send, do a quick non-blocking recv --timeout 1 to check if the user sent a message while the agent was working. The script's Phase 1 checks pending messages instantly, so this adds ~1s from the container or ~2-3s via host-cmd (due to round-trip overhead). If nothing is pending, continue working. If a message is found, classify it:
recv --timeout 120 to wait for the answer. If the user clarifies, act on it (stop, adjust, or queue). If it times out, resume the original execution as-is and note in the next progress send that the earlier message was unclear.Progress sends and peek checks do NOT include the prompt — only result messages at the end of a task (step 1) get the prompt.
Agent: runs task, gets result
Agent: telegram_bot.sh send "*Task complete.* Result: `X`." (result)
Agent: telegram_bot.sh send "━━━...💬 *Awaiting*...⏳ _10 minutes_...━━━" (prompt)
Agent: telegram_bot.sh recv --timeout 600 (backgrounded, polls terminal)
User (on TG): "now run it again with Y=5"
Agent: reads reply from terminal output
Agent: telegram_bot.sh send "Got it: re-run with `Y`=5. Working on it..." (echo — no prompt)
Agent: telegram_bot.sh recv --timeout 1 (pre-exec peek — nothing)
Agent: starts long execution...
Agent: telegram_bot.sh send "Progress: 3/10 configs done..." (progress)
Agent: telegram_bot.sh recv --timeout 1 (peek — nothing)
Agent: continues working...
Agent: telegram_bot.sh send "Progress: 6/10 configs done..." (progress)
Agent: telegram_bot.sh recv --timeout 1 (peek — message found!)
User (on TG): "stop"
Agent: stops current execution
Agent: telegram_bot.sh send "*Stopped.* 6/10 configs completed: ..." (result)
Agent: telegram_bot.sh send "━━━...💬 *Awaiting*...⏳ _10 minutes_...━━━" (prompt)
Agent: telegram_bot.sh recv --timeout 600 (loop continues — waiting)
User (on TG): "ok run the remaining 4 with Z=3"
Agent: telegram_bot.sh send "Got it: configs 7-10 with `Z`=3..." (echo)
Agent: executes...
Agent: telegram_bot.sh send "*Done.* Remaining 4 configs complete: ..." (result)
Agent: telegram_bot.sh send "━━━...💬 *Awaiting*...⏳ _10 minutes_...━━━" (prompt)
Agent: telegram_bot.sh recv --timeout 600 (waiting again)
User (on TG): "thanks"
Agent: telegram_bot.sh send "━━━...💬 *Awaiting*...⏳ _10 minutes_...━━━" (ack → prompt only, loop back)
Agent: telegram_bot.sh recv --timeout 600 (still listening)
User (on TG): "stop listening"
Agent: telegram_bot.sh send "━━━...✅ *Acknowledged*...━━━" (explicit exit)
Agent: "TG interactive loop ended — user requested."
This skill is opt-in. Only activate when the user explicitly asks for Telegram messaging. Any TG send from another skill enters the REPL — the agent sends the result, then enters REPL mode (step 1 → recv → loop). There are no fire-and-forget sends once TG is active.
Once TG REPL mode is active, the assistant must not return control to the normal agent chat until the REPL exits by timeout or explicit user request.
Typical integration points:
Do not proactively send messages unless the user requested them.
tools
Comprehensive pre-commit verification checklist with five independent responsibilities. (1) Launcher path coverage - verify a change to any launcher-chain file preserves correct behavior across all 16 combinations of entry point × launch mode × stack (Steps 1-4 + 5.1). (2) Ancillary scripts smoke - syntax / help / read-only / caller checks for any `.sh` or `.py` outside the launcher chain (Step 5.2; covers analysis utilities, sourced libraries, debug helpers, sweep tooling). (3) Code quality and design review (Step 6) - propose-first surface of code smells (duplication, long functions, magic numbers, deep nesting, unclear naming, primitive obsession, etc.) and design-decay signals (5th case in a switch, N-th env-var read, hand-rolled retry loops); auto-fix mechanical findings, hold design-shaped ones for explicit go-ahead. (4) Docs / comments / format-consistency (Step 7) - check any commit for stale prose, trailing-comment alignment drift, broken anchors / missing files in links, drifted cross-references, and this skill itself drifting from the code it describes. (5) Sensitive-info leak scan (Step 8) - cluster hostnames, internal IPs, vendor mount paths, hard-coded credentials, internal job IDs; final security gate. Trigger keywords - "verify all launcher paths", "trace launcher chain", "audit entry × launch × stack", "path coverage", "(entry × launch × stack) matrix", "post-launch teardown verification", "pre-commit audit", "before commit", "ready to commit", "verify scripts / utils not broken", "smoke-test the changed scripts", "any utility script broken", "code quality", "design review", "code smells", "tighten and polish", "avoid quality decay", "revisit design choice", "scrub leaks", "check for sensitive info before commit", "any docs or skills need update", "any stale comments", "any inaccurate comments", "comment alignment", "link policy", "broken anchors". Use when modifying `_train.sh`, `_train_with_ray.sh`, `_ray_actor.py`, `_container.sh`, `_job.sbatch`, `_k8s_job.sh`, `in_container_run.sh`, `run_local.sh`, `submit.sh`, `k8s_submit.sh`, `utils/run_setup.sh`, `utils/ray_cluster.sh`, `utils/monkey_patch_maxtext.py`, `utils/coredump.sh`, `utils/stage_timeout.sh`, or anywhere else in the launcher chain. Also use proactively before opening any PR (Steps 5.2, 6, 7, 8 apply universally to all changes that touch code / docs / comments), when investigating a path-specific bug ("this only happens in K8s + 1-gpu-per-process"), after adding a new entry point / launch mode / stack option, after touching any analysis utility (`utils/analyze_job.py`, `utils/perf_server.py`, `utils/profile_drill.py`, `utils/slurm_job_monitor.sh`, etc.), or after editing any doc or skill in the repo (Step 7 catches cross-reference drift).
testing
Find the XLA flag / NCCL env-var combination that maximizes steady-state TGS for one (model × parallelism) cell. Produces an evidence-backed leaderboard, mechanistic explanation of the winning flag, and a deployment recipe. Use when the user asks to tune XLA flags, tune NCCL, find best collective-permute / all-gather threshold, optimize FSDP/PP/TP, close a parallelism-vs-parallelism throughput gap, or sweep cross-iteration prefetch / overlap-limit / async-stream-priority knobs for a specific model.
testing
Diagnose training job incidents and check cluster health using the per-job Prometheus TSDB. Use when the user asks to diagnose a failure root cause, check GPU/network health, query Prometheus metrics, investigate a hang, or when the triage skill recommends deeper TSDB analysis.
development
Direct per-kernel time analysis from JAX / TensorFlow xplane traces via `utils/profile_drill.py`. Use when the user asks for a per-kernel breakdown, step-time composition, cross-variant kernel comparison, main-stream-blocking analysis, or any question that needs ground-truth kernel timings below what TraceLens reports. Triggers include "xplane", "trace.json.gz", "input_scatter_fusion", "RaggedAllToAllKernelImpl", "ncclDevKernel", "step − total kernel", "main-stream-busy", "profile drill-down", or suspicion that TraceLens numbers are off by ~1.5–2×.