workspace/skills/catc-troubleshoot/SKILL.md
Catalyst Center troubleshooting workflows - device unreachable investigation, client connectivity issues, interface down analysis, site-wide outage triage, wireless roaming problems, integration with pyATS for CLI-level diagnostics. Use when a device is unreachable, a user reports connectivity problems, an interface is down, a site has an outage, or wireless clients have roaming issues.
npx skillsauth add automateyournetwork/netclaw catc-troubleshootInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
All Catalyst Center tool calls use this invocation pattern:
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 -u $CATC_MCP_SCRIPT
Variable shorthand used throughout this document:
CATC_CMD="CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 -u $CATC_MCP_SCRIPT"
Use the $MCP_CALL protocol handler to invoke MCP tools:
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" TOOL_NAME 'ARGS_JSON'
Trigger: Catalyst Center shows one or more devices with reachabilityStatus: Unreachable.
Device Unreachable in Catalyst Center
|
+-- Step 1: Confirm the scope
| How many devices are unreachable?
| Are they at the same site?
| |
| +-- Single device unreachable
| | --> Go to "Single Device Investigation"
| |
| +-- Multiple devices at same site
| | --> Go to "Site-Wide Outage Triage" (Issue 4)
| |
| +-- Multiple devices across sites
| --> Go to "Catalyst Center or Upstream Issue"
|
+-- Step 2: Single Device Investigation
| |
| +-- Check device details in CatC
| +-- Check last update time
| +-- Check collection status and error code
| +-- Attempt pyATS connectivity test
| |
| +-- pyATS connects successfully?
| | |
| | +-- YES: CatC polling issue (SNMP/NETCONF credentials, ACL)
| | +-- NO: Device is truly unreachable
| | |
| | +-- Ping from adjacent device (pyATS)
| | +-- Check upstream interface state
| | +-- Check ARP/MAC table on upstream switch
| | +-- Physical layer: power, cable, SFP
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" fetch_devices '{"reachabilityStatus":["Unreachable"]}'
Record for each device: hostname, managementIpAddress, platformId, locationName, lastUpdated, collectionStatus, errorCode.
# Get the site hierarchy to understand the topology
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" fetch_sites '{}'
Group unreachable devices by locationName. If multiple devices at the same site are unreachable, skip to Issue 4 (Site-Wide Outage).
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" fetch_devices '{"hostname":["UNREACHABLE-SW-01"]}'
Analyze:
lastUpdated -- How long has it been unreachable? Minutes = recent event. Hours/days = chronic issue.collectionStatus -- Could Not Synchronize vs Partial Collection Failure vs ManagederrorCode -- Specific error from Catalyst Center's collection engineCCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" fetch_interfaces '{"device_id":"<UUID-from-step-3>"}'
NOTE: This returns the last-known interface state. If the device became unreachable recently, this data may still reflect the pre-failure state. Look for interfaces that were already showing errors before the device went dark.
When $PYATS_MCP_SCRIPT and $PYATS_TESTBED_PATH are available and the device is in the pyATS testbed:
# Attempt to connect and get basic state
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"UNREACHABLE-SW-01","command":"show ip interface brief"}'
If pyATS connects but CatC cannot reach the device:
If pyATS also cannot connect:
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_ping_from_network_device '{"device_name":"UPSTREAM-SW-01","command":"ping 10.1.10.1"}'
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"UPSTREAM-SW-01","command":"show interfaces GigabitEthernet1/0/1"}'
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"UPSTREAM-SW-01","command":"show arp"}'
| Root Cause | CatC Indicators | pyATS Indicators | Resolution |
|------------|-----------------|-------------------|------------|
| Device powered off | Unreachable, no updates | Connection refused | Check power, PDU, UPS |
| Management interface down | Unreachable | Cannot SSH | Verify mgmt interface config via console |
| SNMP credential mismatch | Partial Collection Failure | pyATS connects OK | Fix SNMP community on device or CatC |
| ACL blocking CatC | Unreachable from CatC | pyATS connects OK | Update ACL to permit CatC management IPs |
| Routing issue to mgmt subnet | Unreachable | Ping fails from adjacent device | Check routing table, management VRF |
| Upstream link failure | Multiple devices unreachable | Upstream interface down | Physical layer: cable, SFP, port |
| Device crash/reload | Recently unreachable | Boot messages in logs | Check show logging, show version uptime |
Trigger: User reports cannot connect to the network, slow connectivity, or intermittent drops.
Client Connectivity Issue
|
+-- Step 1: Identify the client
| Do we have the MAC address, IP address, or username?
| |
| +-- MAC known --> get_client_details_by_mac
| +-- IP known --> get_clients_list with ipv4_address filter, then get details by MAC
| +-- Only username/location --> get_clients_list filtered by site
|
+-- Step 2: Is the client visible in CatC?
| |
| +-- YES: Client is associating/authenticating
| | |
| | +-- Check health score
| | +-- Wired or wireless?
| | | |
| | | +-- WIRED --> Check connected switch port, VLAN, duplex
| | | +-- WIRELESS --> Check RSSI, SNR, band, AP, SSID
| | |
| | +-- Check IP assignment (DHCP working?)
| | +-- Check connected network device status
| |
| +-- NO: Client not visible
| |
| +-- Is the access switch/AP reachable?
| +-- Is the client physically connected (cable/Wi-Fi enabled)?
| +-- Is 802.1X/MAB failing? (check ISE if available)
| +-- Is the SSID broadcasting?
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" get_api_compatible_time_range '{"time_window":"last 4 hours"}'
By MAC address (preferred):
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" get_client_details_by_mac '{"client_mac_address":"AA:BB:CC:DD:EE:FF","start_time":1705312800000,"end_time":1705399200000,"view":["Wireless","WirelessHealth"]}'
By IP address:
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" get_clients_list '{"start_time":1705312800000,"end_time":1705399200000,"ipv4_address":["10.1.50.42"]}'
For wired clients, check:
connectedNetworkDeviceName -- Which switch is the client on?connectedNetworkDeviceInterfaceName -- Which port?vlanId -- Correct VLAN assignment?ipv4Address -- Valid IP? Or APIPA (169.254.x.x) indicating DHCP failure?healthScore -- Score 0-10. Below 7 = issue.For wireless clients, check:
ssid -- Connected to the correct SSID?band -- 2.4 GHz or 5 GHz?rssi -- Signal strength (see RSSI table below)snr -- Signal-to-noise ratiodataRate -- Current data rate in MbpsconnectedNetworkDeviceName -- Which AP?healthScore -- Client health scoreIf the client's access switch or AP shows issues:
# Is the access device reachable?
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" fetch_devices '{"hostname":["ACC-SW-01"]}'
# Get the switch's interfaces to check the client-facing port
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" fetch_interfaces '{"device_id":"<switch-UUID>"}'
When CatC data is insufficient (e.g., you need to check port security, 802.1X session state, or MAC address table):
# Check the specific switchport
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"ACC-SW-01","command":"show interfaces GigabitEthernet1/0/15"}'
# Check authentication sessions (802.1X/MAB)
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"ACC-SW-01","command":"show authentication sessions"}'
# Check MAC address table for the client
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"ACC-SW-01","command":"show mac address-table"}'
# Check DHCP snooping bindings
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"ACC-SW-01","command":"show ip dhcp snooping binding"}'
# Check port-security
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"ACC-SW-01","command":"show port-security"}'
| Root Cause | CatC Indicators | pyATS/CLI Indicators | Resolution |
|------------|-----------------|----------------------|------------|
| DHCP exhaustion | Client has 169.254.x.x IP | show ip dhcp pool shows 0 free | Expand DHCP scope or reduce lease time |
| Port security violation | Client not visible | err-disabled state on port | Clear port-security, investigate |
| 802.1X failure | Client not visible | Authentication failed in ISE | Check credentials, certificate, RADIUS |
| Wrong VLAN | Client visible but no connectivity | Port in wrong VLAN | Correct VLAN assignment |
| Duplex mismatch | Poor health score (wired) | Half-duplex on one end | Set both ends to auto or match manually |
| Wireless: low RSSI | Low health score, low RSSI | N/A (wireless) | Move closer to AP, add AP coverage |
| Wireless: co-channel | Intermittent drops | N/A (wireless) | Adjust channel plan, reduce AP power |
| AP overloaded | Multiple clients with issues | N/A (wireless) | Load balance, add AP capacity |
| Upstream link failure | Multiple clients affected | Interface down on distribution | Physical layer or routing issue |
Trigger: Catalyst Center shows interfaces in admin up / operationally down state, or CatC alerts on interface status changes.
# Find the device first
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" fetch_devices '{"hostname":["DIST-SW-01"]}'
# Fetch interfaces
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" fetch_interfaces '{"device_id":"<UUID>"}'
Filter the interface response for:
adminStatus: "UP" AND status: "down" -- Operationally failed interfaces (admin enabled but link is down)| Interface Type | Impact | Urgency | |---------------|--------|---------| | Uplink to distribution/core | CRITICAL: Affects all downstream devices and clients | Immediate | | Inter-switch link (trunk) | HIGH: May break VLAN spanning, STP reconvergence | Immediate | | Access port (user-facing) | MEDIUM: Single user affected | Standard | | Management interface | HIGH: Loses management access to the device | Urgent | | Loopback | HIGH if used for routing (OSPF RID, BGP update-source) | Urgent |
# Detailed interface statistics
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"DIST-SW-01","command":"show interfaces TenGigabitEthernet1/0/1"}'
# Check for recent log messages about the interface
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_show_logging '{"device_name":"DIST-SW-01"}'
# Check SFP/transceiver status
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"DIST-SW-01","command":"show interfaces transceiver"}'
# Check EtherChannel status if applicable
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"DIST-SW-01","command":"show etherchannel summary"}'
# Check spanning-tree for blocked ports
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"DIST-SW-01","command":"show spanning-tree"}'
| Symptom | CatC Data | CLI Investigation | Likely Cause |
|---------|-----------|-------------------|-------------|
| admin UP, oper DOWN | Status mismatch in CatC | No link pulse on interface | Bad cable, SFP, remote end shut |
| CRC errors incrementing | Error counters in interface data | show interfaces CRC count | Faulty cable, bad SFP, duplex mismatch |
| Err-disabled | Port not visible or shows errors | show interfaces status err-disabled | Port-security, BPDU guard, storm-control |
| Flapping (up/down cycles) | Multiple status changes | show logging UPDOWN messages | Loose cable, auto-negotiation failure |
| STP blocked | Port up but no traffic | show spanning-tree BLK state | STP topology issue, loop detected |
Trigger: Multiple users at a single site report connectivity loss, or Catalyst Center shows multiple devices unreachable at the same location.
Site-Wide Outage
|
+-- Step 1: Quantify the impact
| How many devices unreachable at this site?
| How many clients affected?
| |
| +-- ALL devices unreachable at site
| | --> WAN link failure, upstream router, or site power outage
| |
| +-- SOME devices unreachable
| | --> Distribution layer or IDF/MDF issue
| |
| +-- Devices reachable but clients can't connect
| --> DHCP, VLAN, wireless controller, or authentication issue
|
+-- Step 2: Identify the failure boundary
| Which devices ARE still reachable?
| What is the common upstream for the failed devices?
|
+-- Step 3: Investigate the upstream device
| Check interfaces, routing, logs on the common upstream
|
+-- Step 4: Resolve and verify
Fix the issue, confirm all devices come back, verify client counts recover
# Get all devices at the affected site
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" fetch_devices '{"locationName":["Global/USA/NYC/Floor3"]}'
Categorize the results:
# Get time range
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" get_api_compatible_time_range '{"time_window":"last 1 hours"}'
# Count clients at the affected site NOW
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" get_clients_count '{"start_time":1705312800000,"end_time":1705399200000,"site_hierarchy":["Global/USA/NYC/Floor3"]}'
# Compare with a baseline (e.g., same time window yesterday)
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" get_api_compatible_time_range '{"time_window":"yesterday"}'
Impact assessment:
Check the upstream/distribution devices that serve the affected site:
# Check the distribution switch
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" fetch_devices '{"hostname":["DIST-SW-NYC"]}'
# Check its interfaces
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" fetch_interfaces '{"device_id":"<DIST-UUID>"}'
If the distribution/core device is still reachable:
# Check all interfaces on the distribution switch
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"DIST-SW-NYC","command":"show ip interface brief"}'
# Check routing table -- are routes to the affected site present?
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"DIST-SW-NYC","command":"show ip route"}'
# Check OSPF/BGP adjacencies
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"DIST-SW-NYC","command":"show ip ospf neighbor"}'
# Check for recent events in the logs
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_show_logging '{"device_name":"DIST-SW-NYC"}'
# Ping the unreachable access switches from the distribution
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_ping_from_network_device '{"device_name":"DIST-SW-NYC","command":"ping 10.1.30.1"}'
| Scope | Likely Cause | Key Evidence | Resolution | |-------|-------------|-------------|------------| | All devices + all clients | Site power outage | All devices unreachable, no response to pings | Dispatch facilities team, check UPS/PDU | | All devices + all clients | WAN link failure | Edge router reachable but uplink down | Check ISP, activate backup WAN if available | | One floor/IDF | Distribution switch failure | Dist switch unreachable, access switches behind it down | Power cycle, console access, hardware swap | | One floor/IDF | Trunk link failure | Access switches up but no VLAN connectivity | Check trunk port, SFP, cable between access and dist | | Devices up but no clients | DHCP failure | Clients getting APIPA addresses | Check DHCP server, ip helper-address, DHCP relay | | Devices up, wireless clients down | WLC issue | APs reachable but no SSID broadcast | Check WLC status, AP join status | | Devices up, some clients down | VLAN issue | Specific VLAN clients affected | Check VLAN trunking, SVI status |
Trigger: Wireless users report drops when moving between floors or areas, or CatC shows poor wireless health scores.
# Extended time range to capture roaming events
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" get_api_compatible_time_range '{"time_window":"last 8 hours"}'
# Get detailed client info with wireless views
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" get_client_details_by_mac '{"client_mac_address":"AA:BB:CC:DD:EE:FF","start_time":1705312800000,"end_time":1705399200000,"view":["Wireless","WirelessHealth"]}'
Look for:
# List all APs at the affected site
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" fetch_devices '{"family":["Unified AP"],"locationName":["Global/USA/NYC/Floor3"]}'
# Count wireless clients per AP to find overloaded APs
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" get_clients_list '{"start_time":1705312800000,"end_time":1705399200000,"client_type":"wireless","site_hierarchy":["Global/USA/NYC/Floor3"]}'
| Symptom | Likely Cause | Resolution | |---------|-------------|------------| | Client holds onto far AP (sticky client) | 802.11k/v not enabled or client doesn't support it | Enable Optimized Roaming, BSS Transition Management | | Client drops during roam | 802.11r (FT) not enabled, or PMK caching disabled | Enable Fast Transition (FT), enable CCKM/PMK caching | | Client roams to 2.4 GHz | Band steering not aggressive enough | Increase band steering threshold, check client capability | | Dead zone between APs | RF coverage gap | Add AP, increase power (carefully), adjust antenna | | Client bounces between 2 APs | Equal signal from both APs | Reduce power on one AP, adjust cell boundaries |
Trigger: CatC data seems stale, APIs return unexpected errors, or collectionStatus shows failures across many devices.
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" fetch_devices '{"collectionStatus":["Partial Collection Failure"]}'
CCC_HOST=$CCC_HOST CCC_USER=$CCC_USER CCC_PWD=$CCC_PWD python3 $MCP_CALL "python3 -u $CATC_MCP_SCRIPT" fetch_devices '{"collectionStatus":["Could Not Synchronize"]}'
If client API calls return error 14013 (start time > 30 days ago):
get_api_compatible_time_range first to ensure the time range is validIf client API calls return error 14006 (data not ready for endTime):
get_clients_list, get_client_details_by_mac, and get_clients_count tools automatically retry with the API-suggested adjusted endTimeAlways produce a structured findings report after completing a troubleshooting session:
Troubleshooting Report
=======================
Catalyst Center: $CCC_HOST
Timestamp: YYYY-MM-DD HH:MM UTC
Ticket/Case: [reference number if applicable]
Problem Statement
-----------------
[What was reported, by whom, when it started]
Impact Assessment
-----------------
Devices affected: X unreachable out of Y total at site
Clients affected: ~Z clients (estimated from client count delta)
Sites affected: [list]
Severity: CRITICAL / HIGH / MEDIUM / LOW
Investigation Timeline
-----------------------
1. [HH:MM] Checked CatC device reachability -- found X devices unreachable
2. [HH:MM] Identified site-localized issue at Global/USA/NYC/Floor3
3. [HH:MM] Checked distribution switch DIST-SW-NYC -- reachable
4. [HH:MM] Found TenGig1/0/1 (uplink to Floor3 IDF) down/down
5. [HH:MM] pyATS logs show %LINK-3-UPDOWN at 14:23 UTC
6. [HH:MM] SFP transceiver showing rx power below threshold
Root Cause
----------
Failing SFP transceiver on DIST-SW-NYC TenGigabitEthernet1/0/1
(uplink to Floor3 access layer IDF)
Resolution
----------
[Steps taken to resolve, or escalation path if unresolved]
Verification
-----------
- All Floor3 access switches returned to Reachable in CatC
- Client count at Floor3 recovered to baseline (335 clients)
- No further interface flaps observed in 30-minute monitoring window
Preventive Measures
--------------------
- Schedule optical monitoring for all uplink SFPs
- Add redundant uplink to Floor3 IDF (single point of failure identified)
After completing any troubleshooting session, record the findings and resolution in GAIT:
python3 $MCP_CALL "python3 -u $GAIT_MCP_SCRIPT" gait_record_turn '{"input":{"role":"assistant","content":"Troubleshooting: Site-wide outage at Global/USA/NYC/Floor3. Root cause: Failing SFP on DIST-SW-NYC Te1/0/1. Impact: 335 clients, 5 access switches unreachable for 47 minutes. Resolution: SFP replaced, services restored. Preventive: Redundant uplink recommended.","artifacts":[]}}'
testing
Human-in-the-loop escalation via HumanRail — route low-confidence agent decisions, pre-destructive operation approvals, and ambiguous incident tickets to real human engineers. Human answers are verified and returned as structured output. Workers are paid via Lightning Network. Use when the agent is uncertain, when a destructive change needs explicit human sign-off beyond a ServiceNow CR, or when an ambiguous ticket requires human triage before automated handling.
testing
Manage EVE-NG node lifecycle. Use when listing nodes, checking runtime state, creating or deleting nodes, starting or stopping nodes or whole labs, verifying node details, or wiping node NVRAM back to factory defaults.
development
Manage EVE-NG labs and platform inventory. Use when listing labs, checking lab metadata, creating or deleting labs, importing or exporting lab archives, checking EVE-NG health or auth, or verifying available node images before build work.
tools
Execute live CLI commands on running EVE-NG nodes over telnet console. Use when running show commands, making live config changes, verifying protocol state, testing connectivity, checking console readiness, or interacting with IOS, Junos, VPCS, EOS, or NX-OS nodes.