plugins/infrastructure/ansible-workflows/skills/ansible-error-handling/SKILL.md
Implements robust error handling in Ansible using block/rescue/always patterns, retry logic with until/retries, and clear assertion patterns for graceful failure management.
npx skillsauth add basher83/lunar-claude ansible-error-handlingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Patterns for robust error handling in Ansible playbooks and roles.
Handle errors and perform cleanup:
- name: Deploy application
block:
- name: Stop application
ansible.builtin.systemd:
name: myapp
state: stopped
- name: Deploy new version
ansible.builtin.copy:
src: myapp-v2.0
dest: /usr/bin/myapp
- name: Start application
ansible.builtin.systemd:
name: myapp
state: started
rescue:
- name: Rollback to previous version
ansible.builtin.copy:
src: myapp-backup
dest: /usr/bin/myapp
- name: Start application (rollback)
ansible.builtin.systemd:
name: myapp
state: started
- name: Report failure
ansible.builtin.fail:
msg: "Deployment failed, rolled back to previous version"
always:
- name: Cleanup temp files
ansible.builtin.file:
path: /tmp/deploy-*
state: absent
Handle transient failures with retries:
- name: Wait for service to be ready
ansible.builtin.uri:
url: http://localhost:8080/health
status_code: 200
register: health_check
until: health_check.status == 200
retries: 30
delay: 10
# Total wait: up to 5 minutes (30 * 10s)
- name: Wait for cluster to stabilize
ansible.builtin.command: pvecm status
register: cluster_status
until: "'Quorate: Yes' in cluster_status.stdout"
retries: 12
delay: 5
changed_when: false
| Parameter | Description |
|-----------|-------------|
| until | Condition that must be true to stop retrying |
| retries | Maximum number of attempts |
| delay | Seconds between attempts |
Validate inputs with clear error messages:
- name: Validate required variables
ansible.builtin.assert:
that:
- vm_name is defined
- vm_name | length > 0
- vm_memory >= 1024
- vm_cores >= 1
fail_msg: |
Invalid VM configuration:
- vm_name: {{ vm_name | default('NOT SET') }}
- vm_memory: {{ vm_memory | default('NOT SET') }} (min: 1024)
- vm_cores: {{ vm_cores | default('NOT SET') }} (min: 1)
success_msg: "VM configuration validated"
quiet: true
# Variable defined and non-empty
- vm_name is defined and vm_name | trim | length > 0
# Numeric range
- vm_memory >= 1024 and vm_memory <= 65536
# Regex match
- vm_name is match('^[a-z0-9-]+$')
# List has items
- vm_networks | length > 0
# Value in allowed list
- vm_ostype in ['l26', 'win10', 'win11']
Provide actionable error messages:
- name: Check prerequisites
ansible.builtin.command: which docker
register: docker_check
changed_when: false
failed_when: false
- name: Fail if Docker not installed
ansible.builtin.fail:
msg: |
Docker is not installed on {{ inventory_hostname }}.
To install Docker:
sudo apt update
sudo apt install docker.io
Or use the docker role:
ansible-playbook playbooks/install-docker.yml
when: docker_check.rc != 0
Allow expected "failures":
- name: Try to stop service
ansible.builtin.systemd:
name: myservice
state: stopped
register: stop_result
failed_when:
- stop_result.failed
- "'not found' not in stop_result.msg"
# Only fail if error is NOT "service not found"
- name: Join cluster
ansible.builtin.command: pvecm add {{ primary_node }}
register: cluster_join
failed_when:
- cluster_join.rc != 0
- "'already in a cluster' not in cluster_join.stderr"
- "'cannot join' not in cluster_join.stderr"
changed_when: cluster_join.rc == 0
Separate checking from failing for better control:
- name: Check if resource exists
ansible.builtin.command: check-resource {{ resource_id }}
register: resource_check
changed_when: false
failed_when: false # Don't fail here
- name: Fail with context if missing
ansible.builtin.fail:
msg: |
Resource {{ resource_id }} not found.
Command output: {{ resource_check.stderr }}
Hint: Ensure resource was created first.
when: resource_check.rc != 0
Attempt operation, handle specific errors:
- name: Attempt primary approach
block:
- name: Connect via primary endpoint
ansible.builtin.uri:
url: "https://{{ primary_host }}:8006/api2/json"
validate_certs: true
register: primary_result
rescue:
- name: Log primary failure
ansible.builtin.debug:
msg: "Primary endpoint failed: {{ primary_result.msg | default('unknown error') }}"
- name: Try fallback endpoint
ansible.builtin.uri:
url: "https://{{ fallback_host }}:8006/api2/json"
validate_certs: false
register: fallback_result
Run checks from controller for better error context:
- name: Verify API endpoint from controller
ansible.builtin.uri:
url: "https://{{ inventory_hostname }}:8006/api2/json/version"
validate_certs: false
delegate_to: localhost
register: api_check
failed_when: false
- name: Report API status
ansible.builtin.fail:
msg: |
Cannot reach Proxmox API on {{ inventory_hostname }}
Status: {{ api_check.status | default('connection failed') }}
Check: Network connectivity, firewall rules, pveproxy service
when: api_check.status | default(0) != 200
- name: Remove optional backup
ansible.builtin.file:
path: /backup/old-backup.tar.gz
state: absent
ignore_errors: true
register: cleanup_result
- name: Report cleanup status
ansible.builtin.debug:
msg: "Cleanup {{ 'successful' if not cleanup_result.failed else 'skipped' }}"
# BETTER than ignore_errors
- name: Remove backup
ansible.builtin.file:
path: /backup/old-backup.tar.gz
state: absent
register: cleanup_result
failed_when:
- cleanup_result.failed
- "'does not exist' not in cleanup_result.msg | default('')"
---
- name: Deploy with comprehensive error handling
hosts: app_servers
become: true
tasks:
- name: Validate configuration
ansible.builtin.assert:
that:
- app_version is defined
- app_version is match('^\d+\.\d+\.\d+$')
fail_msg: "Invalid app_version: {{ app_version | default('NOT SET') }}"
- name: Deploy application
block:
- name: Download release
ansible.builtin.get_url:
url: "https://releases.example.com/{{ app_version }}.tar.gz"
dest: /tmp/app.tar.gz
register: download
until: download is succeeded
retries: 3
delay: 5
- name: Stop current version
ansible.builtin.systemd:
name: myapp
state: stopped
- name: Extract release
ansible.builtin.unarchive:
src: /tmp/app.tar.gz
dest: /opt/myapp
remote_src: true
- name: Start new version
ansible.builtin.systemd:
name: myapp
state: started
- name: Verify health
ansible.builtin.uri:
url: http://localhost:8080/health
register: health
until: health.status == 200
retries: 6
delay: 10
rescue:
- name: Restore previous version
ansible.builtin.copy:
src: /opt/myapp-backup/
dest: /opt/myapp/
remote_src: true
- name: Start previous version
ansible.builtin.systemd:
name: myapp
state: started
- name: Report deployment failure
ansible.builtin.fail:
msg: |
Deployment of {{ app_version }} failed.
Previous version restored.
Check logs: journalctl -u myapp
always:
- name: Cleanup download
ansible.builtin.file:
path: /tmp/app.tar.gz
state: absent
For detailed error handling patterns and techniques, consult:
references/error-handling.md - Comprehensive error handling patterns, block/rescue/always examples, retry strategiestesting
Audit and improve CLAUDE.md files in repositories. Use when user asks to check, audit, update, improve, or fix CLAUDE.md files. Scans for all CLAUDE.md files, evaluates quality against templates, outputs quality report, then makes targeted updates. Also use when the user mentions "CLAUDE.md maintenance" or "project memory optimization".
tools
Operational tooling for Talos Linux Kubernetes clusters via Sidero Omni with Proxmox infrastructure provider, covering machine classes, CEL storage selectors, and provider lifecycle management.
tools
Best practices for git workflow automation including atomic commits, branch naming, conventional commit format, and changelog generation.
tools
Summarize the current state of the git repository