skills/itsm/problem-analysis/SKILL.md
Root cause analysis and problem management including known error documentation, workaround management, and permanent fix tracking
npx skillsauth add happy-technologies-llc/happy-servicenow-skills problem-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill provides a comprehensive framework for Problem Management in ServiceNow, focusing on identifying root causes, documenting known errors, and implementing permanent solutions.
Problem Management Goals:
When to use this skill:
problem_manager, problem_admin, or itilitsm/incident-lifecycle, itsm/major-incidentFind incidents with similar patterns:
Using MCP:
Tool: SN-Query-Table
Parameters:
table_name: incident
query: active=false^stateIN6,7^sys_created_on>=javascript:gs.daysAgoStart(30)
fields: sys_id,number,short_description,category,cmdb_ci,resolution_notes,close_code
limit: 100
Using REST API:
GET /api/now/table/incident?sysparm_query=active=false^stateIN6,7^sys_created_on>=javascript:gs.daysAgoStart(30)&sysparm_fields=sys_id,number,short_description,category,cmdb_ci,resolution_notes,close_code&sysparm_limit=100
Find repeat incidents by CI:
Tool: SN-Query-Table
Parameters:
table_name: incident
query: cmdb_ci=[ci_sys_id]^sys_created_on>=javascript:gs.daysAgoStart(90)
fields: sys_id,number,short_description,resolution_notes,close_code
order_by: sys_created_on
Grouping Criteria:
Pattern Analysis Work Notes:
Tool: SN-Add-Work-Notes
Parameters:
sys_id: [analysis_task_sys_id]
work_notes: |
=== INCIDENT PATTERN ANALYSIS ===
Analysis Period: [date range]
PATTERN IDENTIFIED:
- Total Related Incidents: [count]
- Affected CI: [CI name]
- Common Category: [category]
- Keyword Pattern: [keywords]
INCIDENT LIST:
- INC001234 - [date] - [short description]
- INC001235 - [date] - [short description]
- INC001240 - [date] - [short description]
FREQUENCY:
- First occurrence: [date]
- Last occurrence: [date]
- Average frequency: [X per week/month]
RECOMMENDATION:
Create problem record for root cause investigation.
Using MCP:
Tool: SN-Create-Record
Parameters:
table_name: problem
data:
short_description: "[CI/Service] - [Brief description of recurring issue]"
description: |
PROBLEM STATEMENT:
[Clear description of the problem being investigated]
RELATED INCIDENTS:
- INC001234 - [date] - [description]
- INC001235 - [date] - [description]
- INC001240 - [date] - [description]
BUSINESS IMPACT:
- Number of incidents: [count]
- Total downtime: [hours]
- Users affected: [count]
- Business cost: [estimate]
INITIAL HYPOTHESIS:
[Initial theory about root cause]
priority: 2
impact: 2
urgency: 2
category: [category]
cmdb_ci: [ci_sys_id]
assignment_group: [team_sys_id]
Using REST API:
POST /api/now/table/problem
Content-Type: application/json
{
"short_description": "Email Server - Intermittent connection failures",
"description": "PROBLEM STATEMENT:\nUsers experiencing intermittent email connectivity...",
"priority": "2",
"impact": "2",
"urgency": "2",
"category": "software",
"cmdb_ci": "ci_sys_id",
"assignment_group": "group_sys_id"
}
Create investigation tasks for different areas:
Tool: SN-Create-Record
Parameters:
table_name: problem_task
data:
parent: [problem_sys_id]
short_description: "RCA - [Area] Investigation"
description: |
Investigate [specific area] as potential root cause.
INVESTIGATION SCOPE:
- [Item 1 to investigate]
- [Item 2 to investigate]
- [Item 3 to investigate]
EXPECTED DELIVERABLES:
- Findings documented in work notes
- Evidence collected (logs, screenshots)
- Recommendation for next steps
assignment_group: [specialist_team]
priority: 2
Document in problem work notes:
Tool: SN-Add-Work-Notes
Parameters:
sys_id: [problem_sys_id]
work_notes: |
=== 5-WHYS ROOT CAUSE ANALYSIS ===
Analyst: [name]
Date: [date]
PROBLEM STATEMENT:
Email service experiencing intermittent connection failures
WHY 1: Why are connection failures occurring?
Answer: The email server is running out of available connections.
WHY 2: Why is the server running out of connections?
Answer: Connection pool is exhausted due to connections not being released.
WHY 3: Why are connections not being released?
Answer: A memory leak in the email client integration is holding connections.
WHY 4: Why is there a memory leak?
Answer: The integration code doesn't properly handle error conditions.
WHY 5: Why doesn't the code handle error conditions?
Answer: Code review process didn't catch the missing error handling.
ROOT CAUSE:
Missing error handling in email client integration code, combined with
insufficient code review process for integration components.
CONTRIBUTING FACTORS:
- No connection timeout configured
- Monitoring didn't alert on connection pool
- Documentation gap on error handling standards
Tool: SN-Add-Work-Notes
Parameters:
sys_id: [problem_sys_id]
work_notes: |
=== FISHBONE ANALYSIS ===
Problem: [Problem statement]
PEOPLE:
- [Factor 1]
- [Factor 2]
PROCESS:
- [Factor 1]
- [Factor 2]
TECHNOLOGY:
- [Factor 1]
- [Factor 2]
ENVIRONMENT:
- [Factor 1]
- [Factor 2]
DATA:
- [Factor 1]
- [Factor 2]
EXTERNAL:
- [Factor 1]
- [Factor 2]
PRIMARY ROOT CAUSES:
1. [Root cause from analysis]
2. [Contributing root cause]
Using MCP:
Tool: SN-Update-Record
Parameters:
table_name: problem
sys_id: [problem_sys_id]
data:
state: 103 # Root Cause Analysis
root_cause: |
ROOT CAUSE IDENTIFIED:
Primary Root Cause:
[Detailed description of the root cause]
Contributing Factors:
1. [Factor 1]
2. [Factor 2]
3. [Factor 3]
Evidence:
- [Log entry/data supporting conclusion]
- [Test result]
- [Other evidence]
Analysis Method: [5-Whys/Fishbone/Fault Tree/Other]
Analysis Date: [date]
Analyst: [name]
Once root cause is confirmed, document as known error:
Using MCP:
Tool: SN-Create-Record
Parameters:
table_name: known_error
data:
short_description: "[CI] - [Error description]"
description: |
KNOWN ERROR DESCRIPTION:
[Clear description of the error condition]
SYMPTOMS:
- [Symptom 1]
- [Symptom 2]
- [Symptom 3]
ROOT CAUSE:
[Root cause description]
AFFECTED SERVICES/CIS:
- [Service/CI 1]
- [Service/CI 2]
workaround: |
WORKAROUND INSTRUCTIONS:
When to use: [Condition when workaround applies]
Steps:
1. [Step 1]
2. [Step 2]
3. [Step 3]
Expected Result: [What user should see]
Limitations:
- [Limitation 1]
- [Limitation 2]
Contact [team] if workaround does not resolve the issue.
problem: [problem_sys_id]
cmdb_ci: [ci_sys_id]
u_permanent_fix_planned: true
u_fix_date: [target date]
Using REST API:
POST /api/now/table/known_error
Content-Type: application/json
{
"short_description": "Email Server - Connection timeout during peak hours",
"description": "KNOWN ERROR DESCRIPTION:\nEmail connections may timeout...",
"workaround": "WORKAROUND INSTRUCTIONS:\n1. Close and reopen email client...",
"problem": "problem_sys_id",
"cmdb_ci": "ci_sys_id"
}
Update related incidents:
Tool: SN-Update-Record
Parameters:
table_name: incident
sys_id: [incident_sys_id]
data:
problem_id: [problem_sys_id]
work_notes: "Linked to Known Error [KERR#] - Workaround available. See KB article [KB#]."
Batch update multiple incidents:
Tool: SN-Query-Table
Parameters:
table_name: incident
query: cmdb_ci=[ci_sys_id]^problem_idISEMPTY^stateIN1,2,3
fields: sys_id,number
limit: 50
Then for each:
Tool: SN-Update-Record
Parameters:
table_name: incident
sys_id: [each_incident_sys_id]
data:
problem_id: [problem_sys_id]
Detailed workaround documentation:
Tool: SN-Add-Work-Notes
Parameters:
sys_id: [problem_sys_id]
work_notes: |
=== WORKAROUND DOCUMENTED ===
WORKAROUND ID: WA-[number]
Effective Date: [date]
Author: [name]
APPLICABILITY:
- Applies to: [specific conditions]
- Does NOT apply to: [exclusions]
PREREQUISITES:
- [Prerequisite 1]
- [Prerequisite 2]
PROCEDURE:
1. [Detailed step 1]
Note: [Important note if applicable]
2. [Detailed step 2]
Expected Result: [What to expect]
3. [Detailed step 3]
VERIFICATION:
- [How to verify workaround worked]
KNOWN LIMITATIONS:
- [Limitation 1]
- [Limitation 2]
ROLLBACK PROCEDURE:
If workaround causes issues:
1. [Rollback step 1]
2. [Rollback step 2]
SUPPORT CONTACT:
If workaround fails, contact [team/person] at [contact info]
Tool: SN-Add-Problem-Comment
Parameters:
sys_id: [problem_sys_id]
comment: |
=== WORKAROUND AVAILABLE FOR SERVICE DESK ===
Problem: [PRB#]
Known Error: [KERR#]
QUICK REFERENCE FOR AGENTS:
Customer Reports: "[Common customer description]"
Solution: [Brief description of workaround]
Steps for Customer:
1. [Simple step 1]
2. [Simple step 2]
3. [Simple step 3]
Escalate if: [Condition for escalation]
Related KB: [KB article link]
Create change request for fix:
Tool: SN-Create-Record
Parameters:
table_name: change_request
data:
short_description: "Fix: [Problem description]"
description: |
CHANGE PURPOSE:
Implement permanent fix for problem [PRB#]
ROOT CAUSE ADDRESSED:
[Root cause from problem record]
PROPOSED SOLUTION:
[Technical description of fix]
EXPECTED OUTCOME:
- [Outcome 1]
- [Outcome 2]
TESTING PLAN:
- [Test 1]
- [Test 2]
ROLLBACK PLAN:
- [Rollback step 1]
- [Rollback step 2]
type: normal
priority: 2
assignment_group: [development_team]
u_related_problem: [problem_sys_id]
Tool: SN-Update-Record
Parameters:
table_name: problem
sys_id: [problem_sys_id]
data:
state: 104 # Fix in Progress
fix: |
PERMANENT FIX PLAN:
Solution: [Description of permanent fix]
Implementation Method:
- Change Request: [CHG#]
- Target Date: [date]
- Implementation Team: [team]
Technical Details:
[Detailed technical fix description]
Validation Criteria:
- [ ] [Criterion 1]
- [ ] [Criterion 2]
- [ ] [Criterion 3]
Post-fix monitoring:
Tool: SN-Query-Table
Parameters:
table_name: incident
query: cmdb_ci=[ci_sys_id]^sys_created_on>=javascript:gs.daysAgoStart(14)^problem_id=[problem_sys_id]
fields: sys_id,number,short_description,sys_created_on
limit: 50
Document verification:
Tool: SN-Add-Work-Notes
Parameters:
sys_id: [problem_sys_id]
work_notes: |
=== FIX VERIFICATION ===
Verification Period: [date range]
METRICS:
- Incidents before fix: [count] per [period]
- Incidents after fix: [count] per [period]
- Reduction: [percentage]
MONITORING DATA:
- [Metric 1]: [before] → [after]
- [Metric 2]: [before] → [after]
USER FEEDBACK:
- [Feedback item 1]
- [Feedback item 2]
VERIFICATION STATUS: [Pass/Fail/Partial]
RECOMMENDATION: [Close problem/Continue monitoring/Additional action]
Using MCP:
Tool: SN-Close-Problem
Parameters:
sys_id: [problem_sys_id]
close_code: Fix Applied
close_notes: |
PROBLEM CLOSURE SUMMARY:
Root Cause: [Summary]
Resolution: [What was done]
Implementation:
- Change: [CHG#]
- Date: [implementation date]
Effectiveness:
- Incident reduction: [percentage]
- Monitoring period: [dates]
- No recurrence confirmed
Documentation:
- Known Error: [KERR#]
- Knowledge Article: [KB#]
Lessons Learned:
- [Lesson 1]
- [Lesson 2]
Using REST API:
PATCH /api/now/table/problem/{sys_id}
Content-Type: application/json
{
"state": "107",
"close_code": "Fix Applied",
"close_notes": "PROBLEM CLOSURE SUMMARY:\n\nRoot Cause: Memory leak in email integration...",
"resolved_at": "2024-01-15 14:30:00",
"resolved_by": "admin"
}
┌────────────┐ ┌────────────────┐ ┌─────────────┐
│ New │────►│ Root Cause │────►│ Fix in │
│ (101) │ │ Analysis (103) │ │ Progress │
└────────────┘ └────────────────┘ │ (104) │
│ └──────┬──────┘
▼ │
┌───────────────┐ │
│ Known Error │ │
│ (102) │◄─────────────┘
└───────┬───────┘
│
▼
┌───────────────────────────────┐
│ Resolved │
│ (106) │
└───────────────┬───────────────┘
│
▼
┌───────────────────────────────┐
│ Closed │
│ (107) │
└───────────────────────────────┘
| Tool | Purpose | Phase |
|------|---------|-------|
| SN-List-Problems | List existing problems | 1 |
| SN-Query-Table | Find incident patterns, verify fix | 1, 6 |
| SN-Create-Record | Create problem, known error, tasks | 1, 2, 3 |
| SN-Update-Record | Update problem status, root cause, fix | 2, 5 |
| SN-Add-Work-Notes | Document analysis and findings | All |
| SN-Add-Problem-Comment | Customer/Service Desk communication | 4 |
| SN-Close-Problem | Close resolved problem | 6 |
| Endpoint | Method | Purpose |
|----------|--------|---------|
| /api/now/table/problem | GET | List problems |
| /api/now/table/problem | POST | Create problem |
| /api/now/table/problem/{sys_id} | PATCH | Update problem |
| /api/now/table/problem_task | POST | Create investigation task |
| /api/now/table/known_error | POST | Create known error |
| /api/now/table/incident | GET | Query related incidents |
Cause: Query criteria too restrictive or wrong CI Solution: Broaden date range; verify CI sys_id; try keyword search
Cause: Insufficient data or multiple contributing factors Solution: Gather more data; involve additional SMEs; consider environmental factors
Cause: Workaround doesn't address all scenarios Solution: Refine workaround; document limitations; create separate workaround for other scenarios
Cause: Root cause not fully addressed; new variation of same issue Solution: Review if truly same problem; may need new problem record for variation
Problem: [Statement]
Why 1: [Question]
Because: [Answer]
Why 2: [Question based on Why 1 answer]
Because: [Answer]
Why 3: [Question based on Why 2 answer]
Because: [Answer]
Why 4: [Question based on Why 3 answer]
Because: [Answer]
Why 5: [Question based on Why 4 answer]
Because: [Answer - typically root cause]
Root Cause: [Summary]
Problem: _________________________________
Categories:
PEOPLE PROCESS TECHNOLOGY ENVIRONMENT
| | | |
+-- [cause] +-- [cause] +-- [cause] +-- [cause]
+-- [cause] +-- [cause] +-- [cause] +-- [cause]
Root Causes Identified:
1. [Primary root cause]
2. [Secondary root cause]
itsm/incident-lifecycle - Incident managementitsm/incident-triage - Incident triageitsm/major-incident - Major incident handlingitsm/change-management - Change for permanent fixesadmin/knowledge-management - Converting to KB articlestesting
Manage supplier onboarding, qualification, performance monitoring, and offboarding with auditable lifecycle controls
tools
Identify emerging risks, prioritize intake signals, and route candidates into formal GRC risk assessment workflows
documentation
Screen inbound documents for completeness, policy risk, and routing readiness before extraction or case workflows
testing
Generate concise task summaries with status, timeline, blockers, SLA risk, and recommended next actions