skills/addressing-explainability-generative-ai/SKILL.md
Explain generative AI outputs using the gSMILE perturbation-based attribution framework. Builds local surrogate models from controlled input perturbations and Wasserstein distance to produce token-level or word-level importance scores for LLM and diffusion model outputs. Triggers: 'explain why the model generated this', 'token attribution for prompt', 'which words in my prompt matter most', 'interpret generative model output', 'build explainability for my LLM pipeline', 'debug prompt influence on generation'
npx skillsauth add ndpvt-web/arxiv-claude-skills addressing-explainability-generative-aiInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to implement the gSMILE (generative Statistical Model-agnostic Interpretability with Local Explanations) framework for explaining why generative models — LLMs and text-to-image diffusion models — produce specific outputs. The core technique treats any generative model as a black box, systematically perturbs input tokens, measures output distribution shifts via Wasserstein distance, and fits a weighted local linear surrogate whose coefficients directly yield per-token importance scores. The result is actionable attribution heatmaps that answer "which parts of my prompt drove this output?"
gSMILE extends the SMILE interpretability method from classification to generative settings. Traditional explainability methods like LIME perturb inputs and measure changes in class probabilities — a scalar target. Generative models produce high-dimensional outputs (token sequences, images), so gSMILE replaces scalar probability shifts with Wasserstein distance between output distributions. Given an original prompt x and a perturbed variant x̂ⱼ, the output-level semantic shift Δ(x, x̂ⱼ) = W(π(y|x), π(y|x̂ⱼ)) captures how much the full output distribution moved. This is the key innovation: measuring distributional shift rather than point predictions.
The perturbations are weighted by proximity to the original input using a Gaussian kernel: wⱼ = exp(-δⱼ² / σ²), where δⱼ is the input-space distance between original and perturbed prompts. This ensures that the surrogate model prioritizes behavior near the original operating point. A weighted linear regression is then fit: hθ(zⱼ) ≈ Δ(x, x̂ⱼ), where zⱼ is a binary feature vector indicating token presence/absence. The resulting coefficients θ are the attribution scores — positive means the token pushes the output in its current direction, negative means it suppresses it, and magnitude indicates strength.
This approach is model-agnostic (requires only API access), local (explains one prompt at a time), and statistically grounded (Lipschitz smoothness assumptions justify the linear approximation in a local neighborhood). It works for any generative model that accepts text input and produces a scorable output.
Define the explanation target. Identify the specific prompt x and the generative model to explain. Record the original output y₀ = model(x) as the baseline. For LLMs, store the full token probability distribution or the output text; for image models, store the generated image embedding.
Tokenize and build the feature space. Split the prompt into N interpretable units (tokens, words, or phrases). Create a binary feature vector template z ∈ {0,1}^N where zᵢ = 1 means token i is present.
Generate J perturbations. For each perturbation j = 1..J (typically J = 100–500), create a masked variant x̂ⱼ by randomly dropping or replacing a subset of tokens. Record each perturbation's binary feature vector zⱼ. Use a dropout rate of 10–30% per perturbation to balance signal and locality.
Collect perturbed outputs. Pass each x̂ⱼ through the generative model to obtain output ŷⱼ. For LLMs, capture the output token probabilities or full generated text. For image models, capture the generated image or its CLIP embedding.
Compute output-level distances. For each perturbation, calculate Δⱼ = W(y₀, ŷⱼ) using Wasserstein distance. For text: use token probability distribution divergence or embedding cosine distance. For images: use CLIP embedding L2 distance or LPIPS perceptual distance.
Compute input-level distances and Gaussian weights. Calculate δⱼ as the Hamming distance (or cosine distance) between the original feature vector and zⱼ. Apply the Gaussian kernel: wⱼ = exp(-δⱼ² / σ²). Tune σ so that roughly 60–80% of perturbations receive meaningful weight.
Fit the weighted linear surrogate. Solve the weighted least squares problem: θ* = argmin_θ Σⱼ wⱼ · (hθ(zⱼ) - Δⱼ)², where hθ(z) = θ · z + θ₀. Use sklearn.linear_model.LinearRegression with sample weights, or numpy.linalg.lstsq on the weight-scaled system.
Extract and normalize attribution scores. The coefficient vector θ* contains per-token importance. Normalize to [0, 1] range for heatmap visualization. Higher absolute values indicate stronger influence. Sign indicates direction: positive means the token's presence increases output divergence from a null baseline.
Visualize as attribution heatmaps. Map normalized scores back to the original prompt tokens. Render as a color-coded heatmap (red = high importance, blue = low) using matplotlib, HTML spans with background colors, or a terminal-based display.
Validate with fidelity and stability metrics. Compute weighted MSE between surrogate predictions and actual distances (fidelity). Compute Jaccard similarity of top-K attributed tokens across repeated runs (stability). Flag explanations with fidelity R² < 0.7 as unreliable.
Example 1: Explaining an LLM response to a factual question
User: "I prompted GPT with 'What is the capital of France and why is it historically significant?' and got a long answer. Which parts of my prompt drove the response?"
Approach:
["What", "is", "the", "capital", "of", "France", "and", "why", "is", "it", "historically", "significant", "?"]Output:
Token Attribution Scores (normalized 0–1):
What ██░░░░░░░░ 0.21
is ░░░░░░░░░░ 0.03
the ░░░░░░░░░░ 0.02
capital ████████░░ 0.78
of ░░░░░░░░░░ 0.04
France ██████████ 0.95
and █░░░░░░░░░ 0.08
why ██████░░░░ 0.61
is ░░░░░░░░░░ 0.02
it ░░░░░░░░░░ 0.03
historically ████████░░ 0.82
significant ██████░░░░ 0.58
? ░░░░░░░░░░ 0.01
Surrogate fidelity R²: 0.87
Top drivers: "France" (0.95), "historically" (0.82), "capital" (0.78)
Example 2: Debugging a text-to-image prompt
User: "My Stable Diffusion prompt 'A serene Japanese garden at sunset with cherry blossoms and a stone bridge' keeps generating images without the bridge. Help me understand which words matter."
Approach:
Output:
Token Attribution Scores:
A ░░░░░░░░░░ 0.01
serene ███░░░░░░░ 0.29
Japanese ████████░░ 0.81
garden █████████░ 0.88
at ░░░░░░░░░░ 0.02
sunset ██████░░░░ 0.63
with ░░░░░░░░░░ 0.03
cherry ██████░░░░ 0.59
blossoms █████░░░░░ 0.52
and ░░░░░░░░░░ 0.01
a ░░░░░░░░░░ 0.01
stone █░░░░░░░░░ 0.11
bridge ██░░░░░░░░ 0.14
Insight: "stone" (0.11) and "bridge" (0.14) have very low attribution,
meaning the model largely ignores them. "Japanese garden" dominates.
Recommendation: Move "stone bridge" to the beginning of the prompt
or increase its weight with prompt syntax like "(stone bridge:1.5)".
Example 3: Implementing gSMILE as a Python module
User: "Build me a reusable explainability module for my LLM API wrapper."
import numpy as np
from sklearn.linear_model import LinearRegression
from sentence_transformers import SentenceTransformer
class GSMILEExplainer:
def __init__(self, model_fn, n_perturbations=200, dropout_rate=0.2, sigma=None):
"""
model_fn: callable that takes a string prompt and returns output text
"""
self.model_fn = model_fn
self.n_perturbations = n_perturbations
self.dropout_rate = dropout_rate
self.sigma = sigma
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
def explain(self, prompt: str) -> dict:
tokens = prompt.split()
n = len(tokens)
sigma = self.sigma or (0.5 * np.sqrt(n))
# Step 1: Get baseline output
baseline_output = self.model_fn(prompt)
baseline_emb = self.embedder.encode([baseline_output])[0]
# Step 2: Generate perturbations
Z = np.ones((self.n_perturbations, n)) # feature vectors
deltas = np.zeros(self.n_perturbations) # output distances
input_dists = np.zeros(self.n_perturbations)
for j in range(self.n_perturbations):
mask = np.random.random(n) > self.dropout_rate
Z[j] = mask.astype(float)
perturbed_tokens = [t for t, m in zip(tokens, mask) if m]
perturbed_prompt = " ".join(perturbed_tokens) if perturbed_tokens else tokens[0]
perturbed_output = self.model_fn(perturbed_prompt)
perturbed_emb = self.embedder.encode([perturbed_output])[0]
deltas[j] = np.linalg.norm(baseline_emb - perturbed_emb)
input_dists[j] = np.sum(1 - mask) / n # normalized Hamming
# Step 3: Gaussian kernel weights
weights = np.exp(-(input_dists ** 2) / (sigma ** 2))
# Step 4: Fit weighted linear surrogate
reg = LinearRegression()
reg.fit(Z, deltas, sample_weight=weights)
# Step 5: Extract attributions
raw_scores = np.abs(reg.coef_)
max_score = raw_scores.max() if raw_scores.max() > 0 else 1.0
normalized = raw_scores / max_score
return {
"tokens": tokens,
"attributions": normalized.tolist(),
"coefficients": reg.coef_.tolist(),
"fidelity_r2": reg.score(Z, deltas, sample_weight=weights),
"top_tokens": sorted(
zip(tokens, normalized), key=lambda x: -x[1]
)[:5],
}
Do:
σ relative to feature dimensionality. A good starting point is σ = 0.5 * sqrt(N) where N is the token count.Avoid:
| Problem | Symptom | Solution | |---------|---------|----------| | Low fidelity (R² < 0.5) | Surrogate doesn't match model behavior | Increase perturbation count, reduce dropout rate, or narrow σ | | All attributions near-equal | Flat coefficient vector | The prompt may be redundant or the model is insensitive — try coarser perturbation (drop whole phrases) | | API rate limits during perturbation | Timeouts or 429 errors | Implement exponential backoff and batch perturbations; reduce J | | Unstable scores across runs | High variance in top-K tokens | Increase J to 400+, average over 3–5 independent runs | | Degenerate outputs from heavy masking | Model returns empty or nonsensical text | Cap dropout at 20%, ensure at least 70% of tokens survive each perturbation | | Memory issues with image embeddings | OOM on large batches | Process perturbations in chunks of 20–50, store embeddings to disk |
Addressing Explainability of Generative AI using SMILE — Zeinab Dehghani, 2026. Introduces the gSMILE framework for model-agnostic explainability of generative models via contrastive perturbation, Wasserstein distance, and weighted linear surrogates. Focus on Sections 3–5 for the mathematical framework, perturbation strategies, and evaluation metrics.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".