plugins/aem/6.5-lts/skills/aem-workflow/workflow-debugging/SKILL.md
Debug AEM Workflow issues on AEM 6.5 LTS and AMS including stuck workflows, failed steps, missing Inbox tasks, launcher failures, stale instances, thread pool exhaustion, queue backlogs, purge failures, and permissions errors. Use when the user reports workflow problems on AEM 6.5 LTS or AMS, asks why a workflow is stuck or failed, needs step-by-step troubleshooting, or provides thread dumps, configuration status dumps, or Sling Job console output for analysis.
npx skillsauth add adobe/skills workflow-debuggingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Production-grade debugging for AEM Granite Workflow engine, launcher, Inbox, Sling Jobs, thread pools, and purge on AEM 6.5 LTS and Adobe Managed Services (AMS).
| Symptom | symptom_id | First action |
|---------|------------|--------------|
| Workflow stuck (not advancing) | workflow_stuck_not_progressing | Open instance; note current step type. No work item → stale. |
| Task not in Inbox | task_not_in_inbox | Confirm Participant step; assignee = logged-in user; Inbox filters. |
| Workflow not starting (launcher) | workflow_not_starting_launcher | Launcher enabled; path/event match payload. |
| Workflow fails or shows error | workflow_fails_or_shows_error | Instance history; error.log for instance ID; payload and process. |
| Step failed, retries exhausted | step_failed_retries_exhausted | Logs → process.label → JMX retryFailedWorkItems or Inbox retry. |
| Stale (no current work item) | stale_workflow_no_work_item | JMX countStaleWorkflows → restartStaleWorkflows(dryRun=true). |
| Repository bloat / too many instances | repository_bloat_too_many_instances | JMX purgeCompleted(dryRun=true) or Purge Scheduler. |
| User cannot see or complete item | user_cannot_see_or_complete_item | Assignee/initiator/superuser; enforce flags. |
| Cannot delete model | cannot_delete_model | JMX countRunningWorkflows → terminate → delete. |
| Slow throughput / queue backlog | slow_throughput_queue_backlog | JMX returnSystemJobInfo; queue.maxparallel on Granite Workflow Queue; Sling thread pool. |
| Auto-advancement not working | workflow_auto_advance_failure | Check default thread pool saturation; Sling Scheduler; timeout jobs. |
| New workflow not working | workflow_setup_validation | Model sync, launcher, process registration, permissions. |
countStaleWorkflows → restartStaleWorkflows(dryRun=true).process.label registered, payload path exists, bundle active, no exception in execute().Thread dumps on 6.5 / AMS are obtained via jstack or by requesting from AMS support. Configuration status ZIPs from Felix Console → Status → Configuration Status.
default thread pool (critical path)The Sling Scheduler ApacheSlingdefault uses ThreadPool: default. This pool runs:
com/adobe/granite/workflow/timeout/job events to the Sling Job system (the job itself then runs on the Granite Workflow Queue, see Step 3c)Check the Sling Thread Pools status page (/system/console/status-slingthreadpools):
| Field | Healthy | Problem | |-------|---------|---------| | active count | < max pool size | = max pool size (saturated) | | block policy | RUN | ABORT (rejects tasks when full) | | max pool size | sized for workload | OOTB on AEM 6.5 LTS is 5/5 (Apache Sling default). Bump to 20+ in OSGi config for environments with many custom periodic schedulers, otherwise schedulers can starve. |
If active count = max pool size AND block policy = ABORT:
Check the Threads status page (/system/console/status-Threads) or the jstack thread dump (/system/console/status-jstack-threaddump):
sling-default- threadselapsed time — threads stuck for hours indicate a hung external call without timeoutCheck Apache Sling Job Thread Pool in the Sling Thread Pools status page:
Check the Sling Jobs page (/system/console/slingevent):
| Field | Healthy | Problem | |-------|---------|---------| | Queued Jobs (overall) | 0 | > 0 (jobs waiting) | | Failed Jobs | 0 | > 0 (step failures) | | Active Jobs | 0-N | 0 when Queued > 0 (jobs not picked up) |
Check topic statistics for workflow model:
com/adobe/granite/workflow/job/var/workflow/models/<modelName>Failed Jobs / low Finished Jobs ratio → process step throwing exceptionsCheck Granite Workflow Queue configuration:
/system/console/configMgr/org.apache.sling.event.jobs.QueueConfiguration~workflow before assuming.Check the Sling Scheduler status page (/system/console/status-slingscheduler):
WorkflowStatsMBean collector (used by the Statistics MBean) — its nextFireTime should be in the near future; nextFireTime: null means the trigger was deregistered.com/adobe/granite/workflow/timeout/job topic itself is a Sling Job, not a Quartz job — check it on the Sling Jobs page (/system/console/slingevent), not here.ApacheSlingdefault uses ThreadPool: default — that's how the periodic timeout-detection scheduler reaches the workflow engine.| Pattern | Cause | Action |
|---------|-------|--------|
| Error executing workflow step | Process step exception | Check stack; fix process code or payload |
| getProcess for '<name>' failed | No WorkflowProcess registered | Deploy bundle; match process.label |
| Cannot archive workitem | Archive failure → stale risk | JMX restartStaleWorkflows |
| refreshing the session since we had to wait for a lock | Lock contention | Tune queue.maxparallel on the Granite Workflow Queue (Apache Sling Job Queue Configuration); reduce concurrent writes to the same path |
| Terminate failed / Resume failed / Suspend failed | Permissions (not initiator/superuser) | Check enforceWorkflowInitiatorPermissions; add to superusers |
| PathNotFoundException (workflow/payload) | Payload/launcher path missing | Verify payload exists; check launcher config path |
| Error adding launcher config | Launcher config path not created | Create /conf/global/settings/workflow/launcher/config |
| retrys exceeded - remove isTransient | Transient workflow failed after retries | Fix process code; instance persisted for admin handling |
| RejectedExecutionException | Thread pool full with ABORT policy | Increase pool size or change policy to RUN; fix stuck threads |
| Workflow is already finished | Terminate on completed/aborted instance | Check logic calling terminate |
| Workflow purge '<name>' : repository exception | Purge JCR error | Check permissions; repo health |
In Felix Console → OSGi → Configuration (/system/console/configMgr):
| Config | Property | Check |
|--------|----------|-------|
| WorkflowSessionFactory | cq.workflow.job.retry | Default 3; increase for flaky steps |
| WorkflowSessionFactory | cq.workflow.superuser | Must include admin users/groups (OOTB list includes admin, administrators, workflow-process-service, workflow-service, workflow-administrators, wcm-workflow-service) |
| WorkflowSessionFactory | granite.workflow.enforceWorkitemAssigneePermissions | true (OOTB) = only assignee sees items |
| WorkflowSessionFactory | granite.workflow.enforceWorkflowInitiatorPermissions | true (OOTB) = only initiator (or superuser) can terminate/suspend/resume |
| WorkflowSessionFactory | granite.workflow.inboxQuerySize | Max work items returned per Inbox query. OOTB 2000; raise if heavy users hit the cap |
| WorkflowSessionFactory | granite.workflow.maxPurgeSaveThreshold | OOTB 20 — commit after this many purged instances. Raise carefully to reduce JCR overhead during large purges |
| WorkflowSessionFactory | granite.workflow.maxPurgeQueryCount | OOTB 1000 — JCR query batch size during purge. Tune with above |
| Granite Workflow Queue (org.apache.sling.event.jobs.QueueConfiguration~workflow) | queue.maxparallel | Real parallelism knob. OOTB on AEM 6.5 LTS is 0.5 (50% of CPU cores). Adobe's Workflows Best Practices recommends between half and three-quarters of processor cores. cq.workflow.job.max.procs displayed in Felix Config Manager is an orphaned metatype label with no code path that reads it (verified against source on release/660 and prod/cq660) — do not rely on it. Verify the running value at /system/console/configMgr/org.apache.sling.event.jobs.QueueConfiguration~workflow |
| DefaultThreadPool (name=default) | block policy | OOTB RUN. ABORT would silently drop workflow timeout jobs — keep RUN unless you have a specific reason |
| DefaultThreadPool (name=default) | max pool size | OOTB on AEM 6.5 LTS is 5 (Apache Sling default). Bump to 20+ for environments with many custom periodic schedulers; otherwise the pool can starve under load |
| Purge Scheduler (com.adobe.granite.workflow.purge.Scheduler) | scheduledpurge.daysold | 30 default; tune per environment. Factory PID — deploy one config per schedule |
| Purge Scheduler | scheduledpurge.workflowStatus | Array-typed; e.g. ["COMPLETED"] |
| Action | 6.5 LTS / AMS approach |
|--------|----------------------|
| Retry failed work item | JMX retryFailedWorkItems or Inbox Retry |
| Restart stale workflows | JMX restartStaleWorkflows(dryRun=true) then execute |
| Purge completed | JMX purgeCompleted(dryRun=true) or Purge Scheduler |
| Increase parallelism | Felix Console: queue.maxparallel on the Granite Workflow Queue (Apache Sling Job Queue Configuration); or OSGi config in repo |
| Fix thread pool exhaustion | Restart instance (immediate); fix stuck scheduler code; change block policy to RUN |
| Fix process not found | Deploy bundle; process.label must match; Sync model |
| Fix auto-advancement | Verify default pool not saturated; timeout jobs scheduled; block policy = RUN |
All workflow maintenance and diagnostic operations live on a single MBean: com.adobe.granite.workflow:type=Maintenance. A separate MBean — com.adobe.granite.workflow:type=Statistics — exposes time-series workflow execution metrics for trend analysis.
| MBean | Operations | Purpose |
|-------|------------|---------|
| com.adobe.granite.workflow:type=Maintenance | purgeCompleted(model [optional], days, dryRun), purgeActive(model [optional], days, dryRun), countRunningWorkflows(model [optional]), countCompletedWorkflows(model [optional]), countStaleWorkflows(model [optional]), restartStaleWorkflows(model [optional], dryRun), retryFailedWorkItems(dryRun, model [optional]), terminateFailedInstances(restart, dryRun, model [optional]), returnSystemJobInfo(), returnWorkflowQueueInfo(), returnWorkflowJobTopicInfo(), returnFailedWorkflowCount(model [optional]), returnFailedWorkflowCountPerModel(), listRunningWorkflowsPerModel(), listCompletedWorkflowsPerModel(), fetchModelList() | Purge, stale detection/restart, retry failed items, bulk terminate, queue/job diagnostics, per-model counts and enumeration |
| com.adobe.granite.workflow:type=Statistics | getResults, clearRecords; plus get/set accessors for DataLifeTime, DataFidelityTime, DataProcessRate, DataRate | Time-series workflow execution statistics |
Always use dryRun=true first before executing destructive purge or retry operations.
Symptom: Workflow auto-advancement stops; timeout jobs not firing; workflows stuck at participant step despite timeout configured.
Root cause chain:
AccessTokenScheduler) makes blocking HTTP call without timeoutconcurrent = true allows overlapping executions on each cron triggerdefault pool thread indefinitelyABORT (OOTB is RUN), new Quartz triggers are rejected silently; on RUN they instead pile up on the caller thread and back-pressure the dispatchcom/adobe/granite/workflow/timeout/job eventsDiagnosis checklist:
/system/console/status-slingthreadpools): Pool default → active count = max pool size?default → block policy = ABORT?/system/console/status-Threads) or jstack: All sling-default-* threads stuck on same stack?/system/console/slingevent): Workflow job topic has high Failed Jobs?/system/console/status-slingscheduler): ThreadPool = default for ApacheSlingdefault?Fix: Restart instance (immediate); fix scheduler code (add HTTP timeout, set concurrent=false); change pool policy to RUN; increase pool size.
Symptom: numberOfFailedJobs >> numberOfFinishedJobs for a workflow topic.
Root cause: Process step exception, payload deleted, or process not registered.
Diagnosis: Search error.log for Error executing workflow step + model name. Check process.label in Felix Console → OSGi Components.
Symptom: Workflows in RUNNING state but no work items; Inbox empty despite running instances.
Root cause: Cannot archive workitem during transition; JCR session crash during step completion.
Diagnosis: Search for Cannot archive workitem; JMX countStaleWorkflows; restartStaleWorkflows(dryRun=true).
tools
Identifies which items (pages, campaigns, products, channels, regions) had the biggest increases or decreases for a key metric between two time periods. Use this skill when someone asks "what's up and what's down," "which campaigns moved the most," "top gainers and losers," "what pages are trending," "show me what changed by channel," or any variation of identifying the biggest movers and decliners for a metric.
tools
Compares the performance of two or more audience segments across key metrics side by side. Use this skill when someone wants to compare audiences, cohorts, or groups — for example, "how do mobile users compare to desktop users on conversion," "compare new vs. returning visitors," "show me the difference between these two segments," "compare these audiences on our KPIs," or "which segment performs better." Also trigger for "segment comparison," "audience comparison," or "cohort comparison."
business
Produces a compact KPI digest showing how key metrics changed over a period and what's driving the movement. Use this skill when someone asks for a performance summary, a weekly recap, a morning briefing, a KPI update, or any variation of "how did we do this week/month." Also trigger for requests like "give me a performance overview," "what moved in the last 7 days," "pull our KPI report," or "summarize our metrics."
testing
Analyzes a multi-step conversion funnel to find where users drop off and which steps have the worst leakage. Use this skill when someone describes a journey or funnel and asks about conversion rates, drop-off, fallout, or step completion. Trigger for phrases like "analyze our onboarding funnel," "where are users dropping off," "what's our checkout conversion rate," "funnel analysis," "show me fallout between these steps," or "which step loses the most users."