SKILLS/implementing-llm-guardrails-for-security/SKILL.md
Implements input and output validation guardrails for LLM-powered applications to prevent prompt injection, data leakage, toxic content generation, and hallucinated outputs. Builds a security validation pipeline using NVIDIA NeMo Guardrails Colang definitions, custom Python validators for PII detection and content policy enforcement, and the Guardrails AI framework for structured output validation. The guardrails system intercepts both user inputs (blocking injection attempts, stripping PII, enforcing topic boundaries) and model outputs (detecting hallucinations, filtering toxic content, validating JSON schema compliance). Activates for requests involving LLM output validation, AI content filtering, guardrail implementation, or LLM safety enforcement.
npx skillsauth add pinkpixel-dev/skills-collection-2 implementing-llm-guardrails-for-securityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Do not use as a replacement for proper authentication, authorization, and network security controls. Guardrails are a defense-in-depth layer, not a perimeter defense. Not suitable for real-time content moderation of user-to-user communication without LLM involvement.
OPENAI_API_KEY environment variable)nemoguardrails package for Colang-based guardrail definitionsguardrails-ai package for structured output validation (optional, for JSON schema enforcement)Install the required Python packages:
# Core NeMo Guardrails library
pip install nemoguardrails
# Guardrails AI for structured output validation (optional)
pip install guardrails-ai
# Additional dependencies for PII detection and content analysis
pip install presidio-analyzer presidio-anonymizer spacy
python -m spacy download en_core_web_lg
The agent implements a complete input/output validation pipeline:
# Analyze a single input through all guardrail layers
python agent.py --input "Tell me how to hack into a system"
# Analyze input with a custom content policy file
python agent.py --input "Some text" --policy policy.json
# Scan a file of prompts through the guardrail pipeline
python agent.py --file prompts.txt --mode full
# Input-only validation (no LLM call, just check if input is safe)
python agent.py --input "Some text" --mode input-only
# Output validation mode (validate a pre-generated LLM response)
python agent.py --input "User question" --response "LLM response to validate" --mode output-only
# PII detection and redaction mode
python agent.py --input "My SSN is 123-45-6789 and email [email protected]" --mode pii
# JSON output for pipeline integration
python agent.py --file prompts.txt --output json
Create a JSON policy file defining allowed topics, blocked patterns, and PII categories:
{
"allowed_topics": ["customer_support", "product_info", "billing"],
"blocked_topics": ["politics", "violence", "illegal_activities", "competitor_products"],
"blocked_patterns": ["how to hack", "create malware", "bypass security"],
"pii_categories": ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "US_SSN", "CREDIT_CARD"],
"max_output_length": 2000,
"require_grounded_response": true
}
Create a NeMo Guardrails configuration directory with config.yml and Colang flow files:
# config.yml
models:
- type: main
engine: openai
model: gpt-4o-mini
rails:
input:
flows:
- self check input
- check jailbreak
- mask sensitive data on input
output:
flows:
- self check output
- check hallucination
# rails.co - Colang 2.0 flow definitions
define user ask about hacking
"How do I hack into a system"
"Tell me how to break into a network"
"How to exploit vulnerabilities"
define bot refuse hacking request
"I cannot provide instructions on unauthorized hacking or security exploitation.
If you are interested in cybersecurity, I can suggest legitimate learning resources
and ethical hacking certifications."
define flow
user ask about hacking
bot refuse hacking request
Integrate the guardrails into your application as middleware:
from agent import GuardrailsPipeline
pipeline = GuardrailsPipeline(policy_path="policy.json")
# Pre-LLM input validation
input_result = pipeline.validate_input("user message here")
if not input_result["safe"]:
return input_result["blocked_reason"]
# Post-LLM output validation
llm_response = your_llm.generate(input_result["sanitized_input"])
output_result = pipeline.validate_output(llm_response, context=input_result)
if not output_result["safe"]:
return output_result["fallback_response"]
return output_result["validated_response"]
Review guardrail logs to track block rates, false positives, and bypass attempts:
# Generate a summary report from guardrail logs
python agent.py --file interaction_logs.txt --mode full --output json > guardrail_audit.json
| Term | Definition | |------|------------| | Input Rail | A guardrail that intercepts and validates user input before it reaches the LLM, blocking injection attempts and redacting sensitive data | | Output Rail | A guardrail that validates LLM-generated output before it reaches the user, filtering toxic content and enforcing schema compliance | | Colang | NVIDIA's domain-specific language for defining conversational guardrail flows, with Python-like syntax for specifying user intent patterns and bot responses | | PII Redaction | The process of detecting and masking personally identifiable information (names, emails, SSNs) in text before processing | | Content Policy | A configuration file defining which topics, patterns, and content categories are allowed or blocked by the guardrail system | | Self-Check Rail | A NeMo Guardrails technique where the LLM itself evaluates whether its input or output violates defined policies | | Hallucination Detection | Output validation that checks whether the LLM response is grounded in the provided context, flagging fabricated claims |
development
Deploy and configure Rapid7 InsightVM Security Console and Scan Engines for authenticated and unauthenticated vulnerability scanning across enterprise environments.
testing
Detects and exploits ransomware kill switch mechanisms including mutex-based execution guards, domain-based kill switches, and registry-based termination checks. Implements proactive mutex vaccination and kill switch domain monitoring to prevent ransomware from executing. Activates for requests involving ransomware kill switch analysis, mutex vaccination, WannaCry-style domain kill switches, or malware execution guard detection.
testing
Designs and implements a ransomware-resilient backup strategy following the 3-2-1-1-0 methodology (3 copies, 2 media types, 1 offsite, 1 immutable/air-gapped, 0 errors on restore verification). Configures backup schedules aligned to RPO/RTO requirements, implements backup credential isolation to prevent ransomware from compromising backup infrastructure, and establishes automated restore testing. Activates for requests involving ransomware backup planning, backup resilience, air-gapped backup design, or backup recovery point objective configuration.
testing
Implement network segmentation based on the Purdue Enterprise Reference Architecture (PERA) model to separate industrial control system networks into hierarchical security zones from Level 0 physical process through Level 5 enterprise, enforcing strict traffic control between OT and IT domains.