Fix Sentry Issues

Systematically discover, triage, investigate, and fix production issues using Sentry MCP. One PR per issue, root-cause analysis required.

Critical Rule: Truth-Seek, Don't Suppress

NEVER treat log level changes as fixes. Changing logger.error to logger.warn or logger.info silences Sentry but doesn't fix the user's experience.

For every failing code path, ask "Why does this fail?" — not "How do I make Sentry quiet?"

Anti-patterns to avoid

These are specific failure modes from real experience. Do NOT do these:

Batch-classifying issues as "expected" without investigating each one. Reading an error message and seeing a fallback path does NOT mean you understand the failure. You must trace the full input path to understand what's being sent and why it fails.
Treating "has a fallback" as "not a problem." A fallback means the user gets degraded results. Ask: why does the primary path fail? Can we prevent the failure upstream? Is the input wrong? Is the timeout too tight? Is there a missing filter?
Combining multiple issues into one "noise reduction" PR. Each issue has its own root cause. Investigate and fix them individually. The only exception is issues that share an identical root cause discovered through investigation.
Throwing away error details. Never change catch (error) { logger.error(..., error) } to catch { logger.info(...) }. The structured error data (status codes, messages, stack traces) is exactly what you need to understand the failure.
Deciding the fix during triage. The triage table should classify issues as "Investigate" or "Ignore" — never pre-decide that the fix is a log level change. You don't know the fix until you've completed investigation.

When a log level change IS valid

A downgrade to logger.info is valid ONLY for genuinely expected operational states — NOT for failures with fallbacks. Examples:

Valid: User's Notion database doesn't have an optional "Author" column → property skipped. This is user configuration, not a failure.
Valid: Supabase returns 404 for a link the user deleted. The resource genuinely doesn't exist.
Invalid: Firecrawl scrape fails 300 times/day → downgrade to info. WHY is it failing? Are we sending URLs it can't handle? Are we hitting rate limits?
Invalid: Summary generation times out → downgrade to info. WHY is the API slow? Is the content too large? Is there a network issue?

Phase 1: Discover

Use Sentry MCP to find the org, project, and all unresolved issues. Use ToolSearch first to load the Sentry MCP tools.

mcp__sentry__find_organizations()
mcp__sentry__find_projects(organizationSlug, regionUrl)
mcp__sentry__search_issues(
  organizationSlug, projectSlugOrId, regionUrl,
  naturalLanguageQuery: "all unresolved issues sorted by events",
  limit: 25
)

Build a triage table. The Action column should be Investigate or Ignore — never a pre-decided fix:

| ID | Title | Events | Action | Reason |
|----|-------|--------|--------|--------|
| PROJ-A | Error in save | 14 | Investigate | User-facing save failure |
| PROJ-B | GM_register... | 3 | Ignore | Greasemonkey extension |

Phase 2: Triage

Classify every issue before writing any code. Only two categories at this stage:

Investigate (our code, worth understanding)

Multiple events establishing a pattern
User sees degraded experience (error status, missing data, broken UI)
High-volume warnings that might indicate an upstream problem
Recurring on every run/sync (stale references, cron-triggered)

Ignore (third-party noise)

Browser extension code (GM_registerMenuCommand, CONFIG, currentInset, MetaMask JSON-RPC)
Stale module imports after deploy (ChunkLoadError — self-resolving)
Single-event transients with no reproduction path
Issues already fixed by a recent commit

Apply triage decisions:

mcp__sentry__update_issue(issueId, organizationSlug, regionUrl, status: "ignored")  // noise
mcp__sentry__update_issue(issueId, organizationSlug, regionUrl, status: "resolved") // already fixed

Phase 3: Investigate (one issue at a time)

For each "Investigate" issue, work through these steps in order. Do NOT skip steps or batch multiple issues together.

3a. Pull event-level data

Issue summaries hide the details you need. Always pull actual events AND the full issue details:

mcp__sentry__get_issue_details(issueId, organizationSlug, regionUrl)
mcp__sentry__search_issue_events(
  issueId, organizationSlug, regionUrl,
  naturalLanguageQuery: "all events with extra data",
  limit: 15
)

Extract from the events: actual URLs, request parameters, stack traces, timestamps, user context, extra data fields (status codes, content lengths, etc.). These are the real inputs that triggered the failure.

3b. Cross-reference with Axiom logs

Axiom events include traceId fields that correlate with Sentry errors. Use the Axiom CLI to pull surrounding logs for richer context:

# Get the traceId from the Sentry event's trace context
# Then query Axiom for all events with that traceId
axiom query "['shiori-events'] | where traceId == '<traceId>'" -f json

# Or search by userId around the error timestamp for broader context
axiom query "['shiori-events'] | where userId == '<userId>' | where _time > datetime('2025-01-01T00:00:00Z') and _time < datetime('2025-01-01T01:00:00Z')" -f json

Axiom logs include fields like authMethod, client_version, event type, and request metadata that Sentry often lacks. This helps you understand what the user was doing before and after the error.

3c. Read the failing code path

Follow the stack trace. Read every file in the chain. Understand what the code does before proposing changes. Use subagents for parallel file exploration if the stack is deep.

3d. Trace the input path upstream

This is the step most often skipped, and the most important:

What data reaches the failing function? Trace backwards from the error to the original input. What URL/payload/parameters were passed?
Should this input have reached this code path at all? Is there a missing filter, validation, or early return upstream?
What does the input look like? For URL-based failures: is it a binary file? A redirect? A localhost URL? Something the API can't handle?
Is the failure in our code or an external service? If external: can we prevent sending bad inputs? Can we add better pre-filtering?

3e. Reproduce and verify

Use the actual failing inputs from Sentry events:

Call the function with the exact data that failed
fetch() the actual URLs that timed out — are they reachable?
Add temporary console.log statements to verify your understanding of the code flow
Check if the failure is in our code or an external service

3f. Identify root cause

Ask these questions in order:

Why does this specific input fail? (e.g., "Firecrawl can't scrape a .png URL")
Why does this input reach this code path? (e.g., "No extension check before calling Firecrawl")
What's the right fix? (e.g., "Filter binary URLs before calling Firecrawl" — not "suppress the log")
Should we also improve observability? (e.g., "Add status code to the log so we can see the failure distribution")

Common root causes:

| Pattern | Root Cause | Real Fix | |---------|-----------|----------| | External API fails on certain URLs | Wrong inputs being sent (binary files, bad formats) | Filter/validate inputs before sending | | External API timeout | Timeout too tight, or input too large, or missing retry | Investigate what's slow, adjust timeout or input size | | DB rejects "invalid json" | Unsanitized input (null bytes, control chars) | Sanitize before insert | | Processing stuck in "error" | Timeout budget doesn't account for full pipeline | Adjust timeouts, save partial results on timeout | | Same error on every cron run | Stale reference to deleted external resource | Detect staleness, auto-clean | | Error logged but details not useful | Error object not included, or status code missing | Improve the log to include actionable details |

3g. Know your log levels

Log levels control what reaches Sentry:

| Level | Sends to Sentry? | Use for | |-------|-------------------|---------| | logger.error | Yes (error) | Unexpected bugs, states that should never occur | | logger.warn | Yes (warning) | Handled failures worth monitoring — keep until you understand the pattern | | logger.info | No | Genuinely expected operational states (not "failures with fallbacks") |

Phase 4: Fix

4a. Branch from main

git checkout main && git pull
git checkout -b fix/<descriptive-name>

One branch per issue. Keep fixes focused.

4b. Write tests first

Tests must use data derived from actual Sentry events, not hypothetical inputs. The test should fail before the fix and pass after.

4c. Implement the fix

Fix the root cause, not the symptom.

Self-check before committing: If the fix is primarily a log level change, STOP. Ask yourself:

Did I investigate why this fails, or did I just see a fallback and suppress?
Can I prevent the failure upstream instead of silencing it?
Am I throwing away error details that would help debug future occurrences?
Would a staff engineer look at this PR and say "but why does it fail in the first place?"

4d. Verify

Run tests (e.g., bun run test)
Run lint
Confirm the fix handles the actual failing inputs from Sentry events
Remove any temporary console.log statements

4e. Create PR

git push -u origin fix/<descriptive-name>
gh pr create --title "<short title>" --body "$(cat <<'EOF'
## Summary
- **Root cause**: [What was actually wrong — the upstream reason, not just "it throws an error"]
- **Fix**: [What changed and why this prevents the failure, not just silences it]

## Test plan
- [x] Tests written using data from Sentry events
- [x] All tests pass
- [x] Lint passes
EOF
)"

4f. Resolve in Sentry

After PR is merged:

git checkout main && git pull

mcp__sentry__update_issue(issueId, organizationSlug, regionUrl, status: "resolved")

Phase 5: Repeat

Work through issues by priority (most events first). After each PR:

Return to main, pull latest
Pick next issue from the triage table
Start Phase 3 again — full investigation for each issue

Checklist Per Issue

[ ] Pulled event-level data (not just issue summary)
[ ] Cross-referenced with Axiom logs using traceId for surrounding context
[ ] Read the failing code path end-to-end
[ ] Traced the input path upstream — understood what data triggers the failure
[ ] Identified root cause (not just "it has a fallback")
[ ] Fix prevents the failure, not just suppresses the log
[ ] Tests use real-world data from Sentry events
[ ] Tests pass, lint passes
[ ] No error details thrown away (catch variables, status codes, etc.)
[ ] PR created with upstream root cause explanation
[ ] Sentry issue resolved after merge

Fix Sentry Issues

Systematically discover, triage, investigate, and fix production issues using Sentry MCP. One PR per issue, root-cause analysis required.

Critical Rule: Truth-Seek, Don't Suppress

NEVER treat log level changes as fixes. Changing logger.error to logger.warn or logger.info silences Sentry but doesn't fix the user's experience.

For every failing code path, ask "Why does this fail?" — not "How do I make Sentry quiet?"

Anti-patterns to avoid

These are specific failure modes from real experience. Do NOT do these:

Batch-classifying issues as "expected" without investigating each one. Reading an error message and seeing a fallback path does NOT mean you understand the failure. You must trace the full input path to understand what's being sent and why it fails.
Treating "has a fallback" as "not a problem." A fallback means the user gets degraded results. Ask: why does the primary path fail? Can we prevent the failure upstream? Is the input wrong? Is the timeout too tight? Is there a missing filter?
Combining multiple issues into one "noise reduction" PR. Each issue has its own root cause. Investigate and fix them individually. The only exception is issues that share an identical root cause discovered through investigation.
Throwing away error details. Never change catch (error) { logger.error(..., error) } to catch { logger.info(...) }. The structured error data (status codes, messages, stack traces) is exactly what you need to understand the failure.
Deciding the fix during triage. The triage table should classify issues as "Investigate" or "Ignore" — never pre-decide that the fix is a log level change. You don't know the fix until you've completed investigation.

When a log level change IS valid

A downgrade to logger.info is valid ONLY for genuinely expected operational states — NOT for failures with fallbacks. Examples:

Valid: User's Notion database doesn't have an optional "Author" column → property skipped. This is user configuration, not a failure.
Valid: Supabase returns 404 for a link the user deleted. The resource genuinely doesn't exist.
Invalid: Firecrawl scrape fails 300 times/day → downgrade to info. WHY is it failing? Are we sending URLs it can't handle? Are we hitting rate limits?
Invalid: Summary generation times out → downgrade to info. WHY is the API slow? Is the content too large? Is there a network issue?

Phase 1: Discover

Use Sentry MCP to find the org, project, and all unresolved issues. Use ToolSearch first to load the Sentry MCP tools.

mcp__sentry__find_organizations()
mcp__sentry__find_projects(organizationSlug, regionUrl)
mcp__sentry__search_issues(
  organizationSlug, projectSlugOrId, regionUrl,
  naturalLanguageQuery: "all unresolved issues sorted by events",
  limit: 25
)

Build a triage table. The Action column should be Investigate or Ignore — never a pre-decided fix:

| ID | Title | Events | Action | Reason |
|----|-------|--------|--------|--------|
| PROJ-A | Error in save | 14 | Investigate | User-facing save failure |
| PROJ-B | GM_register... | 3 | Ignore | Greasemonkey extension |

Phase 2: Triage

Classify every issue before writing any code. Only two categories at this stage:

Investigate (our code, worth understanding)

Multiple events establishing a pattern
User sees degraded experience (error status, missing data, broken UI)
High-volume warnings that might indicate an upstream problem
Recurring on every run/sync (stale references, cron-triggered)

Ignore (third-party noise)

Browser extension code (GM_registerMenuCommand, CONFIG, currentInset, MetaMask JSON-RPC)
Stale module imports after deploy (ChunkLoadError — self-resolving)
Single-event transients with no reproduction path
Issues already fixed by a recent commit

Apply triage decisions:

mcp__sentry__update_issue(issueId, organizationSlug, regionUrl, status: "ignored")  // noise
mcp__sentry__update_issue(issueId, organizationSlug, regionUrl, status: "resolved") // already fixed

Phase 3: Investigate (one issue at a time)

For each "Investigate" issue, work through these steps in order. Do NOT skip steps or batch multiple issues together.

3a. Pull event-level data

Issue summaries hide the details you need. Always pull actual events AND the full issue details:

mcp__sentry__get_issue_details(issueId, organizationSlug, regionUrl)
mcp__sentry__search_issue_events(
  issueId, organizationSlug, regionUrl,
  naturalLanguageQuery: "all events with extra data",
  limit: 15
)

3b. Cross-reference with Axiom logs

Axiom events include traceId fields that correlate with Sentry errors. Use the Axiom CLI to pull surrounding logs for richer context:

# Get the traceId from the Sentry event's trace context
# Then query Axiom for all events with that traceId
axiom query "['shiori-events'] | where traceId == '<traceId>'" -f json

# Or search by userId around the error timestamp for broader context
axiom query "['shiori-events'] | where userId == '<userId>' | where _time > datetime('2025-01-01T00:00:00Z') and _time < datetime('2025-01-01T01:00:00Z')" -f json

3c. Read the failing code path

Follow the stack trace. Read every file in the chain. Understand what the code does before proposing changes. Use subagents for parallel file exploration if the stack is deep.

3d. Trace the input path upstream

This is the step most often skipped, and the most important:

What data reaches the failing function? Trace backwards from the error to the original input. What URL/payload/parameters were passed?
Should this input have reached this code path at all? Is there a missing filter, validation, or early return upstream?
What does the input look like? For URL-based failures: is it a binary file? A redirect? A localhost URL? Something the API can't handle?
Is the failure in our code or an external service? If external: can we prevent sending bad inputs? Can we add better pre-filtering?

3e. Reproduce and verify

Use the actual failing inputs from Sentry events:

Call the function with the exact data that failed
fetch() the actual URLs that timed out — are they reachable?
Add temporary console.log statements to verify your understanding of the code flow
Check if the failure is in our code or an external service

3f. Identify root cause

Ask these questions in order:

Why does this specific input fail? (e.g., "Firecrawl can't scrape a .png URL")
Why does this input reach this code path? (e.g., "No extension check before calling Firecrawl")
What's the right fix? (e.g., "Filter binary URLs before calling Firecrawl" — not "suppress the log")
Should we also improve observability? (e.g., "Add status code to the log so we can see the failure distribution")

Common root causes:

3g. Know your log levels

Log levels control what reaches Sentry:

Phase 4: Fix

4a. Branch from main

git checkout main && git pull
git checkout -b fix/<descriptive-name>

One branch per issue. Keep fixes focused.

4b. Write tests first

Tests must use data derived from actual Sentry events, not hypothetical inputs. The test should fail before the fix and pass after.

4c. Implement the fix

Fix the root cause, not the symptom.

Self-check before committing: If the fix is primarily a log level change, STOP. Ask yourself:

Did I investigate why this fails, or did I just see a fallback and suppress?
Can I prevent the failure upstream instead of silencing it?
Am I throwing away error details that would help debug future occurrences?
Would a staff engineer look at this PR and say "but why does it fail in the first place?"

4d. Verify

Run tests (e.g., bun run test)
Run lint
Confirm the fix handles the actual failing inputs from Sentry events
Remove any temporary console.log statements

4e. Create PR

git push -u origin fix/<descriptive-name>
gh pr create --title "<short title>" --body "$(cat <<'EOF'
## Summary
- **Root cause**: [What was actually wrong — the upstream reason, not just "it throws an error"]
- **Fix**: [What changed and why this prevents the failure, not just silences it]

## Test plan
- [x] Tests written using data from Sentry events
- [x] All tests pass
- [x] Lint passes
EOF
)"

4f. Resolve in Sentry

After PR is merged:

git checkout main && git pull

mcp__sentry__update_issue(issueId, organizationSlug, regionUrl, status: "resolved")

Phase 5: Repeat

Work through issues by priority (most events first). After each PR:

Return to main, pull latest
Pick next issue from the triage table
Start Phase 3 again — full investigation for each issue

Checklist Per Issue

[ ] Pulled event-level data (not just issue summary)
[ ] Cross-referenced with Axiom logs using traceId for surrounding context
[ ] Read the failing code path end-to-end
[ ] Traced the input path upstream — understood what data triggers the failure
[ ] Identified root cause (not just "it has a fallback")
[ ] Fix prevents the failure, not just suppresses the log
[ ] Tests use real-world data from Sentry events
[ ] Tests pass, lint passes
[ ] No error details thrown away (catch variables, status codes, etc.)
[ ] PR created with upstream root cause explanation
[ ] Sentry issue resolved after merge

Adoption

fantomsuj/fix-sentry-issues

$ install --global

Security Scan Results

SKILL.md

Fix Sentry Issues

Critical Rule: Truth-Seek, Don't Suppress

Anti-patterns to avoid

When a log level change IS valid

Phase 1: Discover

Phase 2: Triage

Investigate (our code, worth understanding)

Ignore (third-party noise)

Phase 3: Investigate (one issue at a time)

3a. Pull event-level data

3b. Cross-reference with Axiom logs

3c. Read the failing code path

3d. Trace the input path upstream

3e. Reproduce and verify

3f. Identify root cause

3g. Know your log levels

Phase 4: Fix

4a. Branch from main

4b. Write tests first

4c. Implement the fix

4d. Verify

4e. Create PR

4f. Resolve in Sentry

Phase 5: Repeat

Checklist Per Issue

Related Skills

fantomsuj/workflow

fantomsuj/tdd

fantomsuj/skill-creator

fantomsuj/simplify

fantomsuj/fix-sentry-issues

$ install --global

Security Scan Results

SKILL.md

Fix Sentry Issues

Critical Rule: Truth-Seek, Don't Suppress

Anti-patterns to avoid

When a log level change IS valid

Phase 1: Discover

Phase 2: Triage

Investigate (our code, worth understanding)

Ignore (third-party noise)

Phase 3: Investigate (one issue at a time)

3a. Pull event-level data

3b. Cross-reference with Axiom logs

3c. Read the failing code path

3d. Trace the input path upstream

3e. Reproduce and verify

3f. Identify root cause

3g. Know your log levels

Phase 4: Fix

4a. Branch from main

4b. Write tests first

4c. Implement the fix

4d. Verify

4e. Create PR

4f. Resolve in Sentry

Phase 5: Repeat

Checklist Per Issue

Related Skills

fantomsuj/workflow

fantomsuj/tdd

fantomsuj/skill-creator

fantomsuj/simplify