workspace/skills/pyats-troubleshoot/SKILL.md
Systematic network troubleshooting - connectivity, routing, interface, protocol, and performance issues using structured OSI-layer and divide-and-conquer methodology. Use when something is broken, a device is unreachable, a link is flapping, users report slow performance, or an OSPF/BGP adjacency is down.
npx skillsauth add automateyournetwork/netclaw pyats-troubleshootInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show interfaces"}'
Check:
no shutdown neededPYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show arp"}'
Check:
Incomplete ARP entries → destination not responding on the segment# Check local interface has correct IP
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ip interface brief"}'
# Check routing table for destination
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ip route"}'
# Ping the destination
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_ping_from_network_device '{"device_name":"R1","command":"ping 10.0.0.1"}'
L3 troubleshooting decision tree:
show ip route <destination>Advanced ping options:
# Ping with specific source
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_ping_from_network_device '{"device_name":"R1","command":"ping 10.0.0.1 source Loopback0"}'
# Ping with larger packet size (test MTU)
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_ping_from_network_device '{"device_name":"R1","command":"ping 10.0.0.1 size 1500 df-bit"}'
# Extended ping with repeat count
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_ping_from_network_device '{"device_name":"R1","command":"ping 10.0.0.1 repeat 100 source Loopback0"}'
# Check ACLs that might be blocking traffic
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ip access-lists"}'
# Check NAT translations
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ip nat translations"}'
ACL troubleshooting:
deny any at the end of every ACLPYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ip ospf neighbor"}'
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ip ospf interface"}'
OSPF adjacency troubleshooting checklist:
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ip bgp summary"}'
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ip bgp neighbors"}'
BGP adjacency troubleshooting checklist:
update-source configured correctly? (iBGP typically uses Loopback)ebgp-multihop needed? (if eBGP peer is not directly connected)neighbor X activate present under the correct address-family?neighbor X shutdown)show ip bgp neighbors for error codesBGP NOTIFICATION error codes: | Code | Meaning | |------|---------| | 1 - Message Header Error | Malformed packet | | 2 - OPEN Message Error | Capability mismatch, bad AS, bad hold time | | 3 - UPDATE Message Error | Malformed UPDATE, invalid path attribute | | 4 - Hold Timer Expired | Peer stopped sending KEEPALIVEs | | 5 - FSM Error | Unexpected state transition | | 6 - Cease | Administrative shutdown, max-prefix exceeded, peer deconfigured |
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show processes cpu sorted"}'
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show processes memory sorted"}'
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show interfaces"}'
Look for:
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show policy-map interface"}'
Check: Class drops, queue depths, policing rates.
Is traffic taking the expected path?
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ip route 10.0.0.1"}'
Is traffic taking a suboptimal path through a slower link? Check metrics, AD values, and path selection.
Symptoms: incrementing TTL-exceeded counters, packets bouncing between two routers.
# Check for TTL exceeded ICMP messages
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_show_logging '{"device_name":"R1"}'
Trace the route: check the next-hop for the destination on each router in the path. If router A points to B and B points back to A → routing loop.
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_show_logging '{"device_name":"R1"}'
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show interfaces"}'
Common causes of interface flapping:
Logs to look for:
%LINEPROTO-5-UPDOWN — interface state transitions with timestamps%LINK-3-UPDOWN — physical link state changesWhen NetBox is available ($NETBOX_MCP_SCRIPT is set), query the source of truth during investigation to validate expected state vs reality:
python3 $MCP_CALL "python3 -u $NETBOX_MCP_SCRIPT" netbox_get_objects '{"object_type":"dcim.interfaces","filters":{"device":"R1"},"brief":true}'
Use during troubleshooting:
python3 $MCP_CALL "python3 -u $NETBOX_MCP_SCRIPT" netbox_get_objects '{"object_type":"dcim.cables","filters":{"device":"R1"}}'
Compare: If CDP/LLDP shows a different neighbor than NetBox documents, the physical topology may have changed without being updated — flag for investigation.
python3 $MCP_CALL "python3 -u $NETBOX_MCP_SCRIPT" netbox_get_objects '{"object_type":"ipam.ip-addresses","filters":{"device":"R1"}}'
Compare: Flag IP_DRIFT if device IP differs from NetBox. This is often the root cause of "can't reach X" tickets when someone changed an IP without updating the source of truth.
When troubleshooting spans multiple devices (e.g., connectivity between R1 and R4 traversing R2 and R3), collect state from ALL suspect hops simultaneously rather than one at a time:
First, list all devices to identify the path:
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_list_devices
Then run the same show commands on ALL hops concurrently. For example, for a connectivity loss between R1 and R4:
Run these commands on R1, R2, R3, and R4 simultaneously:
show ip interface brief — interface state on every hopshow ip route <destination> — does each hop have a route?show ip arp — is next-hop reachable at L2?show ip ospf neighbor or show ip bgp summary — adjacency stateBenefit: Instead of spending 4 sequential rounds (one per device), you get the complete picture in a single parallel pass. This lets you immediately identify where in the path the failure occurs.
When an OSPF or BGP adjacency is down, always check BOTH ends simultaneously:
# Run on BOTH peers at the same time
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ip ospf neighbor"}'
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R2","command":"show ip ospf neighbor"}'
Compare: timer mismatches, area mismatches, authentication failures, and MTU issues require data from both ends to diagnose.
After collecting parallel state, sort findings by severity for triage:
┌──────────┬────────────────────────┬──────────┐
│ Device │ Finding │ Severity │
├──────────┼────────────────────────┼──────────┤
│ R2 │ No route to 10.4.0.0/24│ CRITICAL │
│ R3 │ Gi2 down/down │ CRITICAL │
│ R1 │ ARP incomplete for NH │ HIGH │
│ R4 │ All interfaces up │ HEALTHY │
└──────────┴────────────────────────┴──────────┘
Root cause: R3 Gi2 is down → R2 lost its route via R3 → R1 can't ARP for an unreachable next-hop.
After completing a troubleshooting session, record findings and resolution in GAIT:
python3 $MCP_CALL "python3 -u $GAIT_MCP_SCRIPT" gait_record_turn '{"input":{"role":"assistant","content":"Troubleshooting: Connectivity loss R1→R4. Root cause: R3 Gi2 down/down (cable fault). Resolution: Escalated to field team for cable replacement. Verified routing reconverged via alternate path R1→R2→R5→R4.","artifacts":[]}}'
| What to Check | Command |
|---------------|---------|
| Interface status | show ip interface brief |
| Interface details | show interfaces <name> |
| Routing table | show ip route |
| Specific route | show ip route <ip> |
| OSPF neighbors | show ip ospf neighbor |
| BGP summary | show ip bgp summary |
| EIGRP neighbors | show ip eigrp neighbors |
| ARP table | show arp |
| ACLs with hit counts | show ip access-lists |
| NAT translations | show ip nat translations |
| CPU usage | show processes cpu sorted |
| Memory usage | show processes memory sorted |
| System logs | use pyats_show_logging tool |
| Running config | use pyats_show_running_config tool |
| Connectivity test | use pyats_ping_from_network_device tool |
testing
Human-in-the-loop escalation via HumanRail — route low-confidence agent decisions, pre-destructive operation approvals, and ambiguous incident tickets to real human engineers. Human answers are verified and returned as structured output. Workers are paid via Lightning Network. Use when the agent is uncertain, when a destructive change needs explicit human sign-off beyond a ServiceNow CR, or when an ambiguous ticket requires human triage before automated handling.
testing
Manage EVE-NG node lifecycle. Use when listing nodes, checking runtime state, creating or deleting nodes, starting or stopping nodes or whole labs, verifying node details, or wiping node NVRAM back to factory defaults.
development
Manage EVE-NG labs and platform inventory. Use when listing labs, checking lab metadata, creating or deleting labs, importing or exporting lab archives, checking EVE-NG health or auth, or verifying available node images before build work.
tools
Execute live CLI commands on running EVE-NG nodes over telnet console. Use when running show commands, making live config changes, verifying protocol state, testing connectivity, checking console readiness, or interacting with IOS, Junos, VPCS, EOS, or NX-OS nodes.