skills/agentdog-diagnostic-guardrail-framework/SKILL.md
Implement diagnostic safety guardrails for AI agent systems using the AgentDoG three-dimensional taxonomy (risk source, failure mode, real-world harm). Monitors agent trajectories, diagnoses root causes of unsafe actions, and provides fine-grained risk labels beyond binary safe/unsafe classification. Trigger phrases: "add safety guardrails to my agent", "diagnose agent risks", "monitor agent trajectory safety", "implement agentic guardrail", "classify agent risk behavior", "audit agent tool use safety"
npx skillsauth add ndpvt-web/arxiv-claude-skills agentdog-diagnostic-guardrail-frameworkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design and implement diagnostic safety guardrails for AI agent systems following the AgentDoG framework from Liu et al. (2026). Rather than applying blunt binary safe/unsafe filters, this approach monitors full agent trajectories (sequences of actions and observations) and produces structured three-dimensional diagnoses: where the risk originates (source), how the agent fails (failure mode), and what real-world harm results (consequence). This gives developers actionable root-cause information instead of opaque rejections, and catches "seemingly safe but unreasonable" actions that binary classifiers miss.
Three-Dimensional Risk Taxonomy. AgentDoG decomposes agentic risk into three orthogonal dimensions. The Risk Source (Where) identifies the origin: user input (direct prompt injection), environmental observation (indirect injection, unreliable data), external entities (malicious tools, corrupted feedback), or internal logic (hallucination, flawed reasoning). The Failure Mode (How) captures the behavioral mechanism: unconfirmed actions, flawed planning, improper tool use, insecure interactions, procedural deviation, inefficient execution, harmful content generation, unauthorized disclosure, or inaccurate information. The Real-World Harm (What) categorizes consequences across 10 domains: privacy, financial, security, physical/health, psychological, reputational, information ecosystem, public service, fairness, and functional harms.
Trajectory-Level Diagnosis. Unlike guardrails that only inspect the final output, AgentDoG analyzes the full trajectory T = [(a1, o1), (a2, o2), ...] where each step is an action-observation pair. Actions include thinking traces, tool calls, and responses; observations include tool outputs and environment feedback. The guardrail produces a binary verdict plus a diagnostic three-tuple (risk_source, failure_mode, harm_type) drawn from the taxonomy. This means a blocked action comes with an explanation like "Risk Source: Environmental Observation (indirect prompt injection), Failure Mode: Improper Tool Use (wrong parameters), Harm: Financial Loss" rather than just "unsafe."
Attribution for Root-Cause Analysis. AgentDoG includes an explainability layer that traces unsafe verdicts back to specific trajectory steps and even specific sentences. It computes temporal information gain (how much a step increased risk likelihood) and per-sentence attribution scores combining necessity (probability drop when removed) and sufficiency (probability hold when isolated). This lets developers pinpoint exactly which step and which piece of context triggered the safety flag.
Define your agent's action schema. Enumerate all tools the agent can call, their parameter types, and side effects. Categorize each tool's risk surface: read-only vs. write, internal vs. external, reversible vs. irreversible. This mirrors AgentDoG's tool definition inventory (drawn from 2,292+ tool definitions in training).
Implement trajectory logging. Instrument your agent loop to capture each step as a structured (action, observation) pair. Actions must include the tool name, parameters, and any chain-of-thought reasoning. Observations must include the raw tool response and any environment state changes. Store these as an ordered list.
Map your domain risks to the three-dimensional taxonomy. For each axis, select the relevant subcategories:
Build the guardrail evaluation prompt. Construct a system prompt that presents the full trajectory and asks for: (a) a binary safe/unsafe verdict, (b) if unsafe, the three-tuple (source, failure_mode, harm_type) with free-text justification, (c) identification of the specific trajectory step(s) that triggered the verdict. Use the taxonomy labels as a constrained output vocabulary.
Implement pre-execution and post-step hooks. Insert the guardrail at two points: before executing high-risk tool calls (pre-execution gate) and after each step completes (post-step audit). Pre-execution gates block dangerous actions; post-step audits catch cascading risks that only emerge across multiple steps.
Handle "safe but unreasonable" actions. Configure the guardrail to flag actions that are not overtly harmful but indicate degraded agent behavior: redundant API calls (Inefficient Execution), skipping confirmation for destructive operations (Unconfirmed Action), or deviating from the established plan without justification (Procedural Deviation). These should produce warnings rather than hard blocks.
Implement graduated response policies. Map each harm category to an enforcement action: hard-block (e.g., security, financial), soft-block with user confirmation (e.g., privacy, reputational), warn-and-log (e.g., functional, inefficiency). The three-tuple output drives this routing automatically.
Add attribution tracing for flagged actions. When a trajectory is flagged, walk backward through the steps to identify the root cause. Compute which step most increased the risk signal (temporal information gain). Present the developer with: the triggering step, the originating risk source, and the specific text or parameter that caused the flag.
Build a feedback loop for taxonomy refinement. Log all guardrail verdicts with their three-tuples. Periodically review false positives and false negatives to identify taxonomy gaps or miscalibrated thresholds in your domain. Add domain-specific subcategories under the three dimensions as needed.
Test against adversarial trajectories. Construct test cases for each cell of the taxonomy matrix: direct prompt injection leading to unauthorized file deletion (User Input x Improper Tool Use x Security Harm), indirect injection via tool output leading to data exfiltration (Environmental Observation x Insecure Interaction x Privacy Harm), etc. Aim for coverage across all three dimensions.
Example 1: Guardrail middleware for a coding agent
User: "Add a safety guardrail to my coding agent that uses tools like run_shell, write_file, and read_file."
Approach:
run_shell is high-risk (irreversible, external), write_file is medium-risk (irreversible, internal), read_file is low-risk (read-only).run_shell and write_file.Output:
from dataclasses import dataclass
from enum import Enum
from typing import Optional
class RiskSource(Enum):
USER_INPUT = "user_input"
ENVIRONMENTAL = "environmental_observation"
EXTERNAL_ENTITY = "external_entity"
INTERNAL_LOGIC = "internal_logic"
class FailureMode(Enum):
UNCONFIRMED_ACTION = "unconfirmed_action"
FLAWED_PLANNING = "flawed_planning"
IMPROPER_TOOL_USE = "improper_tool_use"
INSECURE_INTERACTION = "insecure_interaction"
PROCEDURAL_DEVIATION = "procedural_deviation"
INEFFICIENT_EXECUTION = "inefficient_execution"
HARMFUL_GENERATION = "harmful_generation"
UNAUTHORIZED_DISCLOSURE = "unauthorized_disclosure"
INACCURATE_INFO = "inaccurate_information"
class HarmType(Enum):
PRIVACY = "privacy"
FINANCIAL = "financial"
SECURITY = "security"
PHYSICAL = "physical_health"
PSYCHOLOGICAL = "psychological"
REPUTATIONAL = "reputational"
INFO_ECOSYSTEM = "info_ecosystem"
PUBLIC_SERVICE = "public_service"
FAIRNESS = "fairness"
FUNCTIONAL = "functional"
@dataclass
class TrajectoryStep:
action: dict # {"tool": str, "params": dict, "reasoning": str}
observation: dict # {"output": str, "side_effects": list}
@dataclass
class Diagnosis:
is_safe: bool
source: Optional[RiskSource] = None
failure_mode: Optional[FailureMode] = None
harm_type: Optional[HarmType] = None
flagged_step: Optional[int] = None
explanation: Optional[str] = None
TOOL_RISK_LEVELS = {
"run_shell": "high", # irreversible, external side effects
"write_file": "medium", # irreversible, internal
"read_file": "low", # read-only
}
PRE_EXEC_GATES = {"high", "medium"} # tools requiring pre-execution check
def evaluate_trajectory(steps: list[TrajectoryStep]) -> Diagnosis:
"""Analyze full trajectory and return three-dimensional diagnosis."""
# Build the evaluation prompt with the full trajectory
prompt = build_guardrail_prompt(steps)
# Call the guardrail model (AgentDoG or LLM-as-judge)
result = call_guardrail_model(prompt)
return parse_diagnosis(result)
def pre_execution_gate(step: TrajectoryStep, history: list[TrajectoryStep]) -> Diagnosis:
"""Check a pending action before execution."""
tool = step.action.get("tool", "")
if TOOL_RISK_LEVELS.get(tool, "low") not in PRE_EXEC_GATES:
return Diagnosis(is_safe=True)
# Check for common unsafe patterns
params = step.action.get("params", {})
# Pattern: shell command with piped secrets or destructive ops
if tool == "run_shell":
cmd = params.get("command", "")
if any(p in cmd for p in ["rm -rf /", "| curl", "chmod 777", "> /dev/"]):
return Diagnosis(
is_safe=False,
source=RiskSource.INTERNAL_LOGIC,
failure_mode=FailureMode.IMPROPER_TOOL_USE,
harm_type=HarmType.SECURITY,
flagged_step=len(history),
explanation=f"Blocked destructive/exfiltration shell command: {cmd[:80]}"
)
# Pattern: writing to sensitive paths without confirmation
if tool == "write_file":
path = params.get("path", "")
if any(s in path for s in [".env", "credentials", "/etc/", ".ssh/"]):
return Diagnosis(
is_safe=False,
source=RiskSource.INTERNAL_LOGIC,
failure_mode=FailureMode.UNCONFIRMED_ACTION,
harm_type=HarmType.PRIVACY,
flagged_step=len(history),
explanation=f"Write to sensitive path requires confirmation: {path}"
)
# Full trajectory evaluation for context-dependent risks
return evaluate_trajectory(history + [step])
Example 2: Diagnosing an indirect prompt injection in tool output
User: "My browsing agent fetched a webpage that contained hidden instructions. How do I detect this?"
Approach:
Output:
def detect_injection_in_observation(
step: TrajectoryStep,
original_task: str,
prior_plan: list[str],
) -> Diagnosis:
"""Detect indirect prompt injection in tool/environment output."""
observation_text = step.observation.get("output", "")
# Heuristic: check for instruction-like patterns in tool output
injection_signals = [
"ignore previous",
"ignore all prior",
"disregard your instructions",
"new task:",
"system prompt:",
"you are now",
"instead, do the following",
]
has_injection_signal = any(
sig in observation_text.lower() for sig in injection_signals
)
if not has_injection_signal:
return Diagnosis(is_safe=True)
# Check if the agent's next action deviates from prior plan
next_action = step.action.get("reasoning", "")
plan_deviation = not any(
planned_step.lower() in next_action.lower()
for planned_step in prior_plan
)
if has_injection_signal and plan_deviation:
return Diagnosis(
is_safe=False,
source=RiskSource.ENVIRONMENTAL,
failure_mode=FailureMode.PROCEDURAL_DEVIATION,
harm_type=HarmType.SECURITY,
flagged_step=None,
explanation=(
"Indirect prompt injection detected in tool output. "
"Agent deviated from established plan after processing "
f"tainted observation containing: "
f"'{next(s for s in injection_signals if s in observation_text.lower())}'"
),
)
# Signal present but agent stayed on plan -- warn but don't block
return Diagnosis(
is_safe=True,
source=RiskSource.ENVIRONMENTAL,
failure_mode=FailureMode.INSECURE_INTERACTION,
harm_type=HarmType.SECURITY,
explanation="Injection-like content in observation, but agent maintained plan adherence. Monitor closely.",
)
Example 3: Auditing a multi-step agent trajectory post-hoc
User: "I have a log of my agent's actions. Can you analyze it for safety issues using the AgentDoG taxonomy?"
Approach:
(action, observation) pairs.Output:
def audit_trajectory(trajectory_log: list[dict]) -> dict:
"""Produce a structured safety audit of a recorded agent trajectory."""
steps = [
TrajectoryStep(action=entry["action"], observation=entry["observation"])
for entry in trajectory_log
]
findings = []
for i, step in enumerate(steps):
diagnosis = evaluate_trajectory(steps[: i + 1])
if not diagnosis.is_safe or diagnosis.explanation:
findings.append({
"step": i,
"tool": step.action.get("tool"),
"verdict": "unsafe" if not diagnosis.is_safe else "warning",
"risk_source": diagnosis.source.value if diagnosis.source else None,
"failure_mode": diagnosis.failure_mode.value if diagnosis.failure_mode else None,
"harm_type": diagnosis.harm_type.value if diagnosis.harm_type else None,
"explanation": diagnosis.explanation,
})
return {
"total_steps": len(steps),
"findings_count": len(findings),
"unsafe_count": sum(1 for f in findings if f["verdict"] == "unsafe"),
"warning_count": sum(1 for f in findings if f["verdict"] == "warning"),
"findings": findings,
"taxonomy_coverage": summarize_taxonomy_hits(findings),
}
# Example audit output:
# {
# "total_steps": 12,
# "findings_count": 2,
# "unsafe_count": 1,
# "warning_count": 1,
# "findings": [
# {
# "step": 4,
# "tool": "web_search",
# "verdict": "warning",
# "risk_source": "environmental_observation",
# "failure_mode": "inaccurate_information",
# "harm_type": "info_ecosystem",
# "explanation": "Agent used unverified search result as factual basis for recommendation"
# },
# {
# "step": 9,
# "tool": "send_email",
# "verdict": "unsafe",
# "risk_source": "internal_logic",
# "failure_mode": "unconfirmed_action",
# "harm_type": "privacy",
# "explanation": "Agent sent email containing user PII without explicit confirmation"
# }
# ]
# }
(source, failure_mode) pair indicate a systematic agent weakness to fix upstream.write_file) may be safe in one trajectory context and unsafe in another. Always evaluate with trajectory history.Paper: Liu, D., Ren, Q., Qian, C., Shao, S., & Xie, Y. (2026). AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security. arXiv:2601.18491v1. https://arxiv.org/abs/2601.18491v1
Look for: Section 3 (three-dimensional taxonomy with full subcategory definitions), Section 4 (ATBench benchmark construction), Section 5 (diagnostic guardrail architecture and trajectory-level evaluation), and Appendix A (complete taxonomy tables with examples for every subcategory).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".