skills/ai-redacting-data/SKILL.md
Strip PII and sensitive data from text before processing with AI. Use when redacting personal information, GDPR compliance, anonymizing customer data, masking credit cards, redacting PHI for HIPAA, stripping emails and phone numbers, de-identifying medical records, removing names from transcripts, PII detection and replacement, building a data anonymization pipeline, sanitizing text before sending to LLMs, pre-processing sensitive documents, privacy-preserving AI pipelines.
npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills ai-redacting-dataInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Strip personal information and sensitive data from text before it reaches an LM — or before it leaves your system.
Before writing code, answer three questions:
The answers drive which pipeline path you need.
import dspy
import re
from dataclasses import dataclass, field
from typing import Literal
lm = dspy.LM("openai/gpt-4o-mini") # or "anthropic/claude-sonnet-4-5-20250929", etc.
dspy.configure(lm=lm)
| Strategy | Example output | Best for |
|---|---|---|
| Category placeholder | [EMAIL], [PHONE] | Readability, compliance audits |
| Indexed placeholder | [PERSON_1], [PERSON_2] | Preserving co-references across text |
| Hash | [a3f9…] | Pseudonymization, re-linkable with key |
| Synthetic / fake | John Smith → Alex Turner | Testing pipelines with realistic-looking data |
| Blank / mask | ████████ | Display-layer redaction |
Regex is fast, deterministic, and never sends PII to an external API. Always run it before the LM pass.
# Patterns for structured PII
PATTERNS = {
"EMAIL": re.compile(r'\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b'),
"PHONE": re.compile(r'\b(\+?1[-.\s]?)?(\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4})\b'),
"SSN": re.compile(r'\b\d{3}-\d{2}-\d{4}\b'),
"CREDIT_CARD": re.compile(r'\b(?:\d{4}[-\s]?){3}\d{4}\b'),
"IP_ADDRESS": re.compile(r'\b\d{1,3}(?:\.\d{1,3}){3}\b'),
"DATE_OF_BIRTH": re.compile(r'\b(?:DOB|Date of Birth|born)[:\s]+\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}\b', re.IGNORECASE),
"ZIP_CODE": re.compile(r'\b\d{5}(?:-\d{4})?\b'),
}
@dataclass
class PIIMatch:
pii_type: str
value: str
start: int
end: int
def regex_detect(text: str) -> list[PIIMatch]:
matches = []
for pii_type, pattern in PATTERNS.items():
for m in pattern.finditer(text):
matches.append(PIIMatch(pii_type=pii_type, value=m.group(), start=m.start(), end=m.end()))
return matches
Use the LM only for PII that requires reading context — names, addresses, and other free-form entities.
class DetectContextualPII(dspy.Signature):
"""Identify personal information in text that requires context to detect.
Return a JSON list of objects with fields: pii_type, value.
PII types to detect - PERSON_NAME, ADDRESS, MEDICAL_RECORD_NUMBER, ORG_NAME (when linked to a person).
Do not flag generic words that happen to resemble names."""
text: str = dspy.InputField(desc="Text to scan for personal information")
pii_entities: list[dict] = dspy.OutputField(
desc='JSON list - [{"pii_type": "PERSON_NAME", "value": "Jane Doe"}, ...]'
)
detect_pii = dspy.Predict(DetectContextualPII)
class PIIRedactor(dspy.Module):
def __init__(self, strategy: Literal["placeholder", "indexed", "blank"] = "placeholder"):
self.strategy = strategy
self.detect = dspy.Predict(DetectContextualPII)
def _make_replacement(self, pii_type: str, entity_index: dict) -> str:
if self.strategy == "indexed":
key = pii_type
n = entity_index.get(key, 0) + 1
entity_index[key] = n
return f"[{pii_type}_{n}]"
elif self.strategy == "blank":
return "████"
else:
return f"[{pii_type}]"
def forward(self, text: str) -> dspy.Prediction:
entity_index: dict[str, int] = {}
seen: dict[str, str] = {} # value → replacement (for consistency)
# Pass 1 - regex for structured patterns
regex_hits = regex_detect(text)
# Pass 2 - LM for contextual PII (only send text with structured PII pre-masked)
pre_masked = text
for hit in sorted(regex_hits, key=lambda h: h.start, reverse=True):
pre_masked = pre_masked[:hit.start] + f"[{hit.pii_type}]" + pre_masked[hit.end:]
lm_result = self.detect(text=pre_masked)
lm_entities = lm_result.pii_entities or []
# Build replacement map from LM entities
for entity in lm_entities:
val = entity.get("value", "")
pii_type = entity.get("pii_type", "PII")
if val and val not in seen:
seen[val] = self._make_replacement(pii_type, entity_index)
# Apply LM replacements to original text
redacted = text
for val, replacement in sorted(seen.items(), key=lambda kv: len(kv[0]), reverse=True):
redacted = redacted.replace(val, replacement)
# Apply regex replacements
for hit in sorted(regex_hits, key=lambda h: h.start, reverse=True):
if hit.value not in seen:
seen[hit.value] = self._make_replacement(hit.pii_type, entity_index)
# Re-apply to get a clean final pass
final = text
for val, replacement in sorted(seen.items(), key=lambda kv: len(kv[0]), reverse=True):
final = final.replace(val, replacement)
return dspy.Prediction(
redacted_text=final,
entities_found=seen,
)
Do not use dspy.Assert or dspy.Suggest here — they are deprecated. Use dspy.Refine with a reward function.
class ValidateRedaction(dspy.Signature):
"""Check whether any PII survived redaction. Return True if clean, False if PII remains."""
original_text: str = dspy.InputField()
redacted_text: str = dspy.InputField()
is_clean: bool = dspy.OutputField(desc="True if no PII remains, False otherwise")
leaked_examples: list[str] = dspy.OutputField(desc="Examples of PII that leaked through, empty list if clean")
def redaction_reward(example, prediction, trace=None) -> float:
validator = dspy.Predict(ValidateRedaction)
result = validator(
original_text=example.text,
redacted_text=prediction.redacted_text,
)
return 1.0 if result.is_clean else 0.0
GDPR - Right to erasure
# Store the entity map so you can reverse-map or fully erase later
redactor = PIIRedactor(strategy="indexed")
result = redactor(text=document)
# Persist result.entities_found keyed by document ID
# On erasure request - delete the mapping; ciphertext becomes permanently anonymized
HIPAA - Safe Harbor de-identification
HIPAA Safe Harbor requires removing 18 PHI identifiers. Add these patterns:
HIPAA_PATTERNS = {
"MRN": re.compile(r'\bMRN[:\s#]+\w+\b', re.IGNORECASE),
"NPI": re.compile(r'\bNPI[:\s#]+\d{10}\b', re.IGNORECASE),
"DEVICE_ID": re.compile(r'\b(?:device|serial)[:\s#]+[A-Z0-9\-]{6,}\b', re.IGNORECASE),
"URL": re.compile(r'https?://\S+'),
"ACCOUNT": re.compile(r'\baccount[:\s#]+\w+\b', re.IGNORECASE),
}
PATTERNS.update(HIPAA_PATTERNS)
# Quick usage
redactor = PIIRedactor(strategy="indexed")
result = redactor(text="Call Jane Doe at 555-123-4567 or [email protected]")
print(result.redacted_text)
# "Call [PERSON_NAME_1] at [PHONE_1] or [EMAIL]"
print(result.entities_found)
# {"Jane Doe": "[PERSON_NAME_1]", "555-123-4567": "[PHONE_1]", "[email protected]": "[EMAIL]"}
The LM sees the PII you are trying to hide - sending raw text to an external LM for detection defeats the purpose if the PII itself is sensitive. Run regex first and send only the pre-masked text to the LM, or use a locally hosted model.
Common words misidentified as names - Claude flags "Will" (a verb), "Mark" (a noun), "Faith" (a concept) as PERSON_NAME. Prompt the signature to exclude words that are clearly not names in context, and validate detections against a stoplist.
Inconsistent placeholders break co-reference - without a seen mapping dict, the same person can appear as [PERSON_1] in paragraph 1 and [PERSON_2] in paragraph 3. Always deduplicate entity values before assigning replacements.
Non-English and transliterated names are missed - Claude's contextual PII detection is weakest on names from languages with different romanization conventions (e.g., Chinese pinyin, Arabic transliteration). Add language-specific name lists or a multilingual NER model for those cases.
Using dspy.Assert/dspy.Suggest for validation is outdated - those APIs are removed in DSPy 2.5+. Use dspy.Refine with a reward function as shown in Step 7.
Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
/ai-parsing-data - extract structured fields from text (complementary pattern)/ai-checking-outputs - validate that outputs meet quality criteria/dspy-refine - iterative refinement with a reward function for validation loops/dspy-retrieval - if you need to redact before indexing documents/ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-doSee examples.md for worked examples:
tools
See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.
testing
Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.
testing
Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.
data-ai
Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.