skills/ai-safety-guardrails/SKILL.md
Use when adding safety layers to AI features - output validation, hallucination detection, content filtering, PII redaction, input sanitization
npx skillsauth add kienbui1995/magic-powers ai-safety-guardrailsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
LLMs will confidently produce harmful, incorrect, or leaked content if you don't add guardrails. Every AI feature needs input validation, output validation, and fallback behavior.
User Input → Input Guard → LLM → Output Guard → User
↓ ↓
Block/sanitize Validate/filter
| Guard | What | Implementation | |-------|------|----------------| | Prompt injection detection | Block "ignore instructions" attacks | Classifier or regex filter | | Input length limit | Prevent context stuffing | Max token count | | PII detection | Redact before sending to LLM | Regex + NER model | | Topic filtering | Block off-topic requests | Classifier |
| Guard | What | Implementation | |-------|------|----------------| | Hallucination check | Verify claims against source | Cross-reference with retrieved docs | | PII leak detection | Catch leaked personal data | Regex scan on output | | Format validation | Ensure JSON/structured output | Schema validation | | Toxicity filter | Block harmful content | Classifier (Perspective API, etc.) | | Confidence threshold | Reject low-confidence answers | "I don't know" fallback |
IF output fails any guard:
→ Don't show raw LLM output
→ Return safe fallback: "I'm not sure about that. Let me connect you with support."
→ Log the failure for review
Modern jailbreaks are sophisticated — detect by pattern + behavior, not just keywords:
Common jailbreak patterns to detect:
Defense approach:
def detect_jailbreak_attempt(user_input: str) -> JailbreakResult:
signals = []
# Pattern-based (fast, cheap)
if re.search(r"ignore (previous|all|your) (instructions|rules)", user_input, re.I):
signals.append("instruction_override")
if re.search(r"(pretend|act as|roleplay|you are now) .*(no restrictions|DAN|uncensored)", user_input, re.I):
signals.append("persona_escape")
# Semantic check (moderate cost)
if count_tokens(user_input) > 2000: # long inputs may hide injection
injection_score = classifier.score(user_input, "prompt_injection")
if injection_score > 0.7:
signals.append("long_form_injection")
return JailbreakResult(
detected=len(signals) > 0,
signals=signals,
action="block" if len(signals) >= 2 else "flag_for_review"
)
Indirect prompt injection through data:
Log all jailbreak attempts for pattern analysis — coordinated attacks show up as clusters.
AI outputs can embed demographic bias. Add systematic checks for high-stakes decisions:
Dimensions to monitor:
Fairness testing in eval:
# Run same query with different demographic contexts — expect consistent quality
test_cases = [
{"query": "Evaluate this resume", "candidate": {"name": "James Smith", "gender": "M"}},
{"query": "Evaluate this resume", "candidate": {"name": "Jamal Smith", "gender": "M"}},
{"query": "Evaluate this resume", "candidate": {"name": "Jane Smith", "gender": "F"}},
]
# Quality scores should not significantly differ
assert max_quality_variance(test_cases) < 0.05 # 5% tolerance
For high-stakes use cases (hiring, lending, medical, legal):
Before LLM: "John Smith ([email protected]) ordered..."
Redacted: "[NAME] ([EMAIL]) ordered..."
After LLM: Re-inject PII only if needed in response
EU AI Act (effective 2025-2026):
| Risk level | Examples | Requirements | |-----------|---------|-------------| | Unacceptable | Social scoring, subliminal manipulation | Prohibited | | High | Hiring, credit, medical, law enforcement | Impact assessment, human oversight, logging | | Limited | Chatbots, deepfakes | Disclosure required | | Minimal | Spam filters, games | No specific requirements |
For high-risk AI systems:
# Required logging for EU AI Act compliance
def log_high_risk_decision(decision, user_id, model_version, confidence):
audit_log.write({
"timestamp": datetime.utcnow().isoformat(),
"decision": decision,
"user_id": hash(user_id), # pseudonymize
"model_version": model_version,
"confidence": confidence,
"human_reviewed": False,
"data_sources": get_data_sources()
})
Sensitive domain guardrails (medical/legal/financial):
SENSITIVE_DOMAIN_RESPONSES = {
"medical": "This is general information only. Consult a qualified healthcare provider for medical advice.",
"legal": "This is not legal advice. Consult a licensed attorney for guidance on your specific situation.",
"financial": "This is not financial advice. Consult a registered financial advisor before making investment decisions.",
}
def add_domain_disclaimer(output: str, detected_domain: str) -> str:
if detected_domain in SENSITIVE_DOMAIN_RESPONSES:
return output + f"\n\n⚠️ {SENSITIVE_DOMAIN_RESPONSES[detected_domain]}"
return output
Always include:
content-media
Use when designing for XR (AR/VR/MR), choosing interaction modes, or adapting 2D UI patterns for spatial computing
testing
Use when creating new skills, editing existing skills, or verifying skills work before deployment
development
Use when you have a spec or requirements for a multi-step task, before touching code
development
Use when executing a structured workflow — select and run a feature, bugfix, refactor, research, or incident template with correct agent and model assignments per phase.