plugins/aem/6.5-lts/skills/aem-replication/troubleshoot-replication/SKILL.md
Diagnose and fix common AEM 6.5 LTS replication issues including blocked queues, connectivity failures, and content distribution problems
npx skillsauth add adobe/skills troubleshoot-replicationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill provides systematic troubleshooting guidance for Adobe Experience Manager 6.5 LTS replication issues. Use this to diagnose and resolve problems with content distribution, agent configuration, and replication workflows.
Use this skill when experiencing:
Follow this systematic approach to identify and resolve replication issues:
1. Verify Symptoms
↓
2. Check Agent Status
↓
3. Review Replication Queue
↓
4. Test Connectivity
↓
5. Examine Logs
↓
6. Verify Configuration
↓
7. Apply Fix
↓
8. Validate Resolution
Symptoms:
Diagnosis:
Check agent status:
Navigate to: Tools → Deployment → Replication → Agents on author
Look for: Red indicator next to agent name
View queue details:
Click agent name
Review queue entries
Check error message on failed item
Root Causes:
Solutions:
Solution A: Retry Failed Item
Steps:
1. Open blocked replication agent
2. Click "Force Retry" button
3. Monitor queue to see if item processes
4. If successful, remaining items will process automatically
Solution B: Clear Failed Item
Steps:
1. Open blocked replication agent
2. Select failed item in queue
3. Click "Clear" to remove it
4. Remaining items will process
5. Manually re-replicate cleared content if needed
Solution C: Restart Replication Components
Navigate to: /system/console/bundles
Search for: "replication"
Restart these bundles:
- com.day.cq.cq-replication
- com.day.cq.cq-replication-audit
- com.day.cq.wcm.cq-wcm-replication
Steps:
1. Find bundle
2. Click "Stop"
3. Wait for status: Resolved
4. Click "Start"
5. Verify status: Active
Solution D: Restart Event Processing
OSGi Console: /system/console/bundles
Restart: Apache Sling Event Support (org.apache.sling.event)
This clears event queue backlogs
Symptoms:
Diagnosis:
Verify Publish instance is running:
# Check if Publish is accessible
curl -I http://publish-host:4503/system/console
# Or browse to:
http://publish-host:4503/system/console
Test network connectivity:
# From Author server
telnet publish-host 4503
# Or
nc -zv publish-host 4503
# Or
ping publish-host
Check replication agent URI:
Navigate to: Agent → Edit → Transport tab
Verify: URI matches Publish host and port
Expected: http://publish-host:4503/bin/receive?sling:authRequestLogin=1
Root Causes:
Solutions:
Solution A: Start Publish Instance
cd /path/to/publish/crx-quickstart
./bin/start
Solution B: Fix Network/Firewall
1. Verify firewall rules allow Author → Publish on port 4503
2. Check network ACLs and security groups (cloud environments)
3. Verify no proxy blocking connection
4. Test from Author server command line
Solution C: Correct Agent URI
Steps:
1. Edit replication agent
2. Transport tab
3. Update URI to correct host/port:
http://correct-publish-host:4503/bin/receive?sling:authRequestLogin=1
4. Save
5. Test Connection
Symptoms:
Diagnosis:
Check agent credentials:
Agent → Edit → Transport tab
Verify: User and Password fields
Verify user exists on Publish:
Publish instance: http://publish:4503/crx/explorer
Navigate to: /home/users
Search for: replication service user
Check user permissions:
On Publish instance:
User → Permissions
Required: Read, Write, Replicate privileges
Root Causes:
Solutions:
Solution A: Update Credentials
Steps:
1. Edit replication agent
2. Transport tab
3. Enter correct username
4. Enter correct password
5. Save
6. Test Connection
Solution B: Create/Enable User on Publish
On Publish instance:
1. Navigate to: Security → Users
2. Create user: replication-service
3. Set password (match Agent configuration)
4. Save
Grant permissions:
1. Navigate to: Security → Permissions
2. Select user: replication-service
3. Add entries:
- Path: /content
- Privileges: jcr:read, crx:replicate, jcr:write
4. Save
Solution C: Reset Password
On Publish instance:
1. Navigate to: Security → Users
2. Find user in agent configuration
3. Click "Set Password"
4. Enter new password
5. Save
On Author:
1. Update replication agent with new password
2. Save
3. Test Connection
Symptoms:
Diagnosis:
Check agent URI protocol:
Agent → Transport tab
URI: https://... or http://...
Review error logs:
error.log contains:
- javax.net.ssl.SSLHandshakeException
- PKIX path building failed
- Certificate validation failed
Solutions:
Solution A: Enable Relaxed SSL (Development Only)
WARNING: Only for development/testing environments
Steps:
1. Edit replication agent
2. Transport tab
3. SSL section:
✓ Relaxed SSL (allow self-signed certificates)
✓ Allow expired (allow expired certificates)
4. Save
5. Test Connection
Solution B: Import Certificates (Production)
On Author instance:
1. Export certificate from Publish:
openssl s_client -connect publish:4503 -showcerts > publish-cert.pem
2. Import into Java keystore:
cd $JAVA_HOME/jre/lib/security
keytool -import -alias publish-aem -file publish-cert.pem \
-keystore cacerts -storepass changeit
3. Restart AEM Author
4. Test replication agent connection
Solution C: Use HTTP (Not Recommended for Production)
If SSL is not required:
1. Edit agent
2. Transport tab
3. Change URI from https:// to http://
4. Save
5. Test Connection
Symptoms:
Diagnosis:
Check content directly on Publish:
Bypass Dispatcher:
http://publish:4503/content/mysite/en/page.html
If content appears here but not via Dispatcher:
→ Dispatcher cache issue
If content doesn't appear:
→ Replication issue
Verify replication status:
On Author:
Page → Properties → Basic tab
Check: Last Published timestamp
Verify: Status shows "Published"
Check Publish logs:
Publish instance: crx-quickstart/logs/error.log
Search for: path of page
Look for: Errors during content import
Root Causes:
Solutions:
Solution A: Manual Dispatcher Cache Clear
# On Dispatcher server
cd /path/to/dispatcher/cache
rm -rf *
# Or specific path
rm -rf /path/to/dispatcher/cache/content/mysite/en/*
# Check Dispatcher logs
tail -f /path/to/dispatcher/logs/dispatcher.log
Solution B: Verify Dispatcher Flush Agent
On Publish instance:
1. Navigate to: Tools → Deployment → Replication
2. Select: Agents on publish
3. Click: Dispatcher Flush
4. Verify: Enabled = ✓
5. Transport tab:
URI: http://dispatcher:80/dispatcher/invalidate.cache
6. Test Connection
7. If failed, fix connectivity
Solution C: Check Content Permissions on Publish
On Publish instance:
1. Navigate to: CRXDE Lite
2. Browse to: /content/mysite/en/page
3. Check node exists
4. Verify permissions: anonymous user can read
5. If not, adjust permissions
Solution D: Force Republish
On Author:
1. Select page(s)
2. Manage Publication
3. Action: Unpublish
4. Execute
5. Wait for completion
6. Manage Publication
7. Action: Publish
8. Execute
9. Verify on Publish
Symptoms:
Diagnosis:
Check Dispatcher Flush agent:
Publish instance: /etc/replication/agents.publish/flush
Status: Should be green (idle/active)
Review Dispatcher configuration:
dispatcher.any file:
/allowedClients {
/0 { /type "allow" /glob "*publish-ip*" }
}
/invalidate {
/0000 { /glob "*" /type "allow" }
}
Check Dispatcher logs:
tail -f /var/log/httpd/dispatcher.log
Look for invalidation requests:
[date] [I] [pid] Received invalidate request
Solutions:
Solution A: Enable Dispatcher Flush Agent
On Publish instance:
1. Navigate to: /etc/replication/agents.publish/flush
2. Edit agent
3. Settings tab: ✓ Enabled
4. Serialization Type: Dispatcher Flush
5. Save
Solution B: Fix Dispatcher Configuration
Edit dispatcher.any:
/allowedClients {
/0 {
/type "allow"
/glob "*<publish-instance-ip>*"
}
}
/cache {
/invalidate {
/0000 { /glob "*" /type "allow" }
}
}
Reload Dispatcher:
apachectl graceful
Solution C: Verify Flush Agent Transport
Dispatcher Flush agent → Transport tab
Correct URI format:
http://dispatcher-host:80/dispatcher/invalidate.cache
OR if virtual host:
http://www.example.com/dispatcher/invalidate.cache
Test Connection
Symptoms:
Diagnosis:
Check agent timeouts:
Agent → Edit → Extended tab
Connection Timeout: default 10000ms
Socket Timeout: default 10000ms
Review package size:
Large packages (>100MB) may timeout
Check: crx-quickstart/logs/replication.log
Solutions:
Solution A: Increase Timeouts
Agent → Edit → Extended tab
Connection Timeout: 30000 (30 seconds)
Socket Timeout: 60000 (60 seconds)
For very large packages: 120000 (2 minutes)
Solution B: Use Asynchronous Replication
For large content:
1. Use default async replication (not synchronous)
2. Monitor queue instead of waiting
3. Package-based replication for very large sets
Solution C: Split Large Packages
Instead of tree activation:
1. Activate in smaller batches
2. Use incremental replication
3. Schedule large activations during off-peak hours
Symptoms:
Diagnosis:
Check enabled agents:
Navigate to: /etc/replication/agents.author
Verify: At least one agent is enabled
Check: Green status indicator
Review agent triggers:
Agent → Edit → Triggers tab
Check: "Ignore default" is NOT checked
Verify: Appropriate triggers enabled
Root Causes:
Solutions:
Solution A: Enable Default Agent
Steps:
1. Navigate to: /etc/replication/agents.author/publish
2. Edit agent
3. Settings tab: ✓ Enabled
4. Triggers tab: Uncheck "Ignore default"
5. Save
Solution B: Check Agent Filters (Programmatic)
// If using ReplicationOptions in code
ReplicationOptions opts = new ReplicationOptions();
// Ensure filter doesn't exclude all agents
opts.setFilter(new AgentFilter() {
public boolean isIncluded(Agent agent) {
// Return true for at least one agent
return !agent.getId().contains("invalid");
}
});
Solution C: Verify Agent Configuration
For each agent:
1. Enabled: ✓
2. Transport URI: Valid and reachable
3. Test Connection: Success
4. Triggers: At least one enabled
5. Ignore default: Unchecked (unless custom workflow)
Symptoms:
Diagnosis:
From official AEM 6.5 LTS documentation: Check for blocking nodes in repository.
Navigate to: CRXDE Lite (/crx/de/index.jsp)
Search for: /bin/replicate or /bin/replicate.json
These nodes may block the replication servlet
Root Cause:
Custom nodes created at /bin/replicate or /bin/replicate.json can override the default replication servlet, preventing normal replication operations.
Solution:
Steps:
1. Navigate to CRXDE Lite: http://localhost:4502/crx/de/index.jsp
2. Check path: /bin/replicate
3. If node exists and is not the system servlet:
- Right-click node
- Select "Delete"
- Save All
4. Repeat for: /bin/replicate.json
5. Test replication
Verification:
After deletion:
1. Activate a test page
2. Check replication queue processes
3. Verify content appears on Publish
Symptoms:
Diagnosis:
From official AEM 6.5 LTS documentation: Replication user lacks namespace management privileges.
Navigate to: CRXDE Lite
Path: Repository level (/)
Check: Replication user privileges
Root Cause:
The replication user (configured in agent's "Agent User Id") doesn't have jcr:namespaceManagement privilege, which is required to replicate custom namespaces.
Solution:
Steps:
1. Navigate to CRXDE Lite
2. Select repository root: /
3. Click "Access Control" tab
4. Find replication service user
5. Add privilege:
- Privilege: jcr:namespaceManagement
- Apply
6. Save All
Grant via CRX/DE:
1. Tools → Security → Permissions
2. Search for: replication-service user
3. Repository level permissions:
✓ jcr:read
✓ jcr:write
✓ crx:replicate
✓ jcr:namespaceManagement ← Add this
4. Save
Symptoms:
/var/replication/data has many itemsDiagnosis:
From official AEM 6.5 LTS documentation: Check for corrupted replication jobs.
Check event queue:
Navigate to: CRXDE Lite
XPath Query:
/jcr:root/var/eventing/jobs//element(*,slingevent:Job)
This shows all pending Sling event jobs
Check replication data:
Path: /var/replication/data
Look for: Large number of nodes
Root Cause: Repository corruption or serialization errors can cause replication jobs to get stuck in the Sling event queue.
Solution A: Clean Event Jobs
Via CRXDE Lite:
1. Run XPath query:
/jcr:root/var/eventing/jobs//element(*,slingevent:Job)
2. Review results for stuck jobs
3. Identify jobs with:
- Old timestamps
- Error properties
- Replication-related topic
4. Carefully delete stuck jobs
5. Save All
Solution B: Clear Replication Data
WARNING: Only if queue is irreparably stuck
1. Stop AEM instance
2. Navigate to: crx-quickstart/repository/
3. Backup: /var/replication/data
4. Delete corrupted items in /var/replication/data
5. Start AEM
6. Verify replication resumes
Solution C: Enable Detailed Logging
From official documentation - configure detailed replication logging:
Navigate to: /system/console/configMgr
Search for: Apache Sling Logging Logger Configuration
Create new configuration:
- Logger: com.day.cq.replication
- Log Level: DEBUG
- Log File: logs/replication.log
Save and review logs for root cause
Symptoms:
Diagnosis:
From official AEM 6.5 LTS documentation: Queue pause has known limitations.
Known Limitations:
Not persisted across restarts
Auto-resume timeout
Workaround:
Instead of pausing, disable the agent:
Agent configuration:
1. Edit agent
2. Settings tab
3. Uncheck "Enabled"
4. Save
This persists across restarts
For temporary pause:
Accept the limitations:
- Must re-pause after restart
- Must re-pause if idle >1 hour
- Use agent monitoring to track state
Location: crx-quickstart/logs/replication.log
Key patterns to search:
# Successful replication
grep "Replication (ACTIVATE) of /content/mysite" replication.log
# Failed replication
grep "ERROR" replication.log | grep replication
# Agent not found
grep "no agent found" replication.log
# Authentication failures
grep "401" replication.log
# Connection issues
grep "Connection refused" replication.log
Example log analysis:
# Find all replication attempts for a path
grep "/content/mysite/en/page" replication.log
# Count failures by type
grep "ERROR" replication.log | cut -d' ' -f5- | sort | uniq -c | sort -rn
# Recent replication activity
tail -100 replication.log | grep "ACTIVATE\|DEACTIVATE"
Navigate to: /system/console/jmx
Search for: com.day.cq.replication
Monitor MBeans:
- Replication Agent Stats
- Queue Size
- Number of queued items
- Last processed item
- Error count
- Replication Service
- Active replications
- Failed replications
- Average processing time
Navigate to: /system/console/configMgr
Relevant configurations:
- Day CQ Replication Service
- Day CQ WCM Replication Impl ReplicationComponentFactoryImpl
- Apache Sling Job Consumer Manager
Verify:
- Services are active
- No configuration errors
- Thread pools not exhausted
Navigate to: /system/console/slingevent
Check:
- Event queue depth
- Stuck events
- Processing rate
- Failed events
If queue stuck:
- Restart org.apache.sling.event bundle
- Check disk space
- Review thread dumps
Schedule periodic tests:
Weekly:
1. Test Connection for all agents
2. Verify queues are empty
3. Review error logs
4. Check disk space on Publish
Set up monitoring for:
Regular maintenance:
Monthly:
- Review and clear old logs
- Verify agent credentials
- Test disaster recovery procedures
- Update documentation
Quarterly:
- Certificate renewal checks
- Performance testing
- Capacity planning review
Use this checklist for systematic troubleshooting:
□ Verify symptom and impact
□ Check replication agent status (green/red)
□ Review replication queue for stuck items
□ Test agent connectivity
□ Verify Publish instance is running
□ Check authentication credentials
□ Review error.log and replication.log
□ Verify agent configuration (URI, credentials, settings)
□ Check network connectivity (ping, telnet, curl)
□ Test direct Publish access (bypass Dispatcher)
□ Verify Dispatcher Flush agent (if applicable)
□ Check content permissions on Publish
□ Review OSGi bundles status
□ Examine Sling event queue
□ Check disk space on Author and Publish
□ Verify JVM heap usage
□ Test with simple content first
□ Document findings and resolution
If issue persists after troubleshooting:
Gather diagnostic information:
Check Adobe Experience League Community:
Adobe Support (if entitled):
configure-replication-agent: Set up and configure agents properlyreplicate-content: Understand replication methodsreplication-api: Programmatic replication for custom codetools
Identifies which items (pages, campaigns, products, channels, regions) had the biggest increases or decreases for a key metric between two time periods. Use this skill when someone asks "what's up and what's down," "which campaigns moved the most," "top gainers and losers," "what pages are trending," "show me what changed by channel," or any variation of identifying the biggest movers and decliners for a metric.
tools
Compares the performance of two or more audience segments across key metrics side by side. Use this skill when someone wants to compare audiences, cohorts, or groups — for example, "how do mobile users compare to desktop users on conversion," "compare new vs. returning visitors," "show me the difference between these two segments," "compare these audiences on our KPIs," or "which segment performs better." Also trigger for "segment comparison," "audience comparison," or "cohort comparison."
business
Produces a compact KPI digest showing how key metrics changed over a period and what's driving the movement. Use this skill when someone asks for a performance summary, a weekly recap, a morning briefing, a KPI update, or any variation of "how did we do this week/month." Also trigger for requests like "give me a performance overview," "what moved in the last 7 days," "pull our KPI report," or "summarize our metrics."
testing
Analyzes a multi-step conversion funnel to find where users drop off and which steps have the worst leakage. Use this skill when someone describes a journey or funnel and asks about conversion rates, drop-off, fallout, or step completion. Trigger for phrases like "analyze our onboarding funnel," "where are users dropping off," "what's our checkout conversion rate," "funnel analysis," "show me fallout between these steps," or "which step loses the most users."