workspace/skills/pyats-health-check/SKILL.md
Comprehensive network device health monitoring - CPU, memory, interfaces, hardware, NTP, logging, environment, and uptime analysis. Use when running a device health check, monitoring CPU or memory usage, checking interface errors, or validating NTP sync.
npx skillsauth add automateyournetwork/netclaw pyats-health-checkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Always run health checks in this exact order. Each section builds on the previous one.
Run show version to establish baseline identity.
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show version"}'
Extract and report:
Thresholds:
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show processes cpu sorted"}'
Thresholds (5-second / 1-minute / 5-minute averages):
90% → CRITICAL: Immediate investigation required
Top processes to watch:
IP Input — high traffic volume or routing loopsBGP Router / BGP I/O — large BGP table or instabilityOSPF-1 Hello — OSPF adjacency issuesCrypto IKMP / Crypto Engine — IPsec overheadSNMP ENGINE — polling stormARP Input — ARP storm or L2 loopPYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show processes memory sorted"}'
Also run:
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show platform resources"}'
Thresholds:
95% → CRITICAL: Risk of process crashes or OOM
Memory consumers to watch:
BGP Router — large BGP table (full internet table = ~1M routes)CEF process — large FIBOSPF Router — large OSPF LSDBHTTP CORE — web server / RESTCONF overheadIOSD iomem — I/O memory for packet buffersPYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ip interface brief"}'
Then for each active interface, get detailed counters:
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show interfaces"}'
Report for each interface:
Flags:
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show inventory"}'
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show platform"}'
Report: Module status (ok/fail), serial numbers, PID, transceiver types and DOM readings.
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ntp associations"}'
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show clock"}'
Flags:
* in associations) → CRITICAL for logging/forensicsPYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_show_logging '{"device_name":"R1"}'
Scan for these patterns:
%SYS-*-RELOAD — reload events%LINEPROTO-5-UPDOWN — interface flaps%OSPF-*-ADJCHG — OSPF adjacency changes%BGP-*-ADJCHANGE — BGP peer state changes%DUAL-*-NBRCHANGE — EIGRP neighbor changes%SYS-2-MALLOCFAIL — memory allocation failure (CRITICAL)%SYS-3-CPUHOG — process monopolizing CPU (HIGH)%TRACKING-* — IP SLA or object tracking changes%SEC-* / %AUTHMGR-* — security events%PLATFORM-*-CRASH — crash events (CRITICAL)Traceback — software bug (CRITICAL — open TAC case)Test reachability to critical infrastructure:
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_ping_from_network_device '{"device_name":"R1","command":"ping 8.8.8.8 repeat 5"}'
Thresholds:
Always produce a summary table:
Device: R1 (devnetsandboxiosxec8k.cisco.com)
Model: C8000V | IOS-XE: 17.x.x | Uptime: XXd XXh
┌──────────────────┬──────────┬─────────────────────────┐
│ Check │ Status │ Details │
├──────────────────┼──────────┼─────────────────────────┤
│ CPU (5min avg) │ HEALTHY │ 12% │
│ Memory │ HEALTHY │ 45% used (1.2G/2.6G) │
│ Interfaces │ WARNING │ Gi2 down/down │
│ Hardware │ HEALTHY │ All modules OK │
│ NTP │ HEALTHY │ Synced, offset 2ms │
│ Logs │ WARNING │ 3 OSPF adjacency flaps │
│ Connectivity │ HEALTHY │ 100% to 8.8.8.8, 23ms │
└──────────────────┴──────────┴─────────────────────────┘
Overall: WARNING — 2 items need attention
Severity order: CRITICAL > HIGH > WARNING > HEALTHY. Overall status = worst individual status.
When NetBox is available ($NETBOX_MCP_SCRIPT is set), cross-reference device state against the source of truth after Steps 1 and 4:
Query NetBox for expected interface states:
python3 $MCP_CALL "python3 -u $NETBOX_MCP_SCRIPT" netbox_get_objects '{"object_type":"dcim.interfaces","filters":{"device":"R1"},"brief":true}'
Compare NetBox intent vs device reality:
Query NetBox for expected IP assignments:
python3 $MCP_CALL "python3 -u $NETBOX_MCP_SCRIPT" netbox_get_objects '{"object_type":"ipam.ip-addresses","filters":{"device":"R1"}}'
Compare: Flag any IP_DRIFT where the device IP differs from NetBox.
To run health checks across ALL devices simultaneously, first list all devices:
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_list_devices
Then run Steps 1-8 on each device concurrently using multiple exec commands. Collect all results and produce a fleet summary:
┌──────────┬──────────┬──────┬────────┬──────────┬─────────────┐
│ Device │ CPU │ Mem │ Intf │ NTP │ Overall │
├──────────┼──────────┼──────┼────────┼──────────┼─────────────┤
│ R1 │ HEALTHY │ WARN │ HEALTHY│ HEALTHY │ WARNING │
│ R2 │ HEALTHY │ OK │ CRIT │ HEALTHY │ CRITICAL │
│ SW1 │ HIGH │ OK │ HEALTHY│ CRIT │ CRITICAL │
└──────────┴──────────┴──────┴────────┴──────────┴─────────────┘
Sort devices by severity (CRITICAL first) for triage prioritization.
After completing a health check, record the session in GAIT:
python3 $MCP_CALL "python3 -u $GAIT_MCP_SCRIPT" gait_record_turn '{"input":{"role":"assistant","content":"Health check completed on R1: CPU HEALTHY (12%), Memory WARNING (78%), Interfaces HEALTHY, NTP HEALTHY. Overall: WARNING.","artifacts":[]}}'
testing
Human-in-the-loop escalation via HumanRail — route low-confidence agent decisions, pre-destructive operation approvals, and ambiguous incident tickets to real human engineers. Human answers are verified and returned as structured output. Workers are paid via Lightning Network. Use when the agent is uncertain, when a destructive change needs explicit human sign-off beyond a ServiceNow CR, or when an ambiguous ticket requires human triage before automated handling.
testing
Manage EVE-NG node lifecycle. Use when listing nodes, checking runtime state, creating or deleting nodes, starting or stopping nodes or whole labs, verifying node details, or wiping node NVRAM back to factory defaults.
development
Manage EVE-NG labs and platform inventory. Use when listing labs, checking lab metadata, creating or deleting labs, importing or exporting lab archives, checking EVE-NG health or auth, or verifying available node images before build work.
tools
Execute live CLI commands on running EVE-NG nodes over telnet console. Use when running show commands, making live config changes, verifying protocol state, testing connectivity, checking console readiness, or interacting with IOS, Junos, VPCS, EOS, or NX-OS nodes.