skills/benchmarking-zero-shot-few-shot-phishing/SKILL.md
Detect phishing URLs using LLM zero-shot and few-shot prompting with structured classification prompts. Use when: 'classify this URL as phishing or legitimate', 'analyze URLs for phishing', 'build a phishing detection prompt', 'detect suspicious URLs with few-shot examples', 'benchmark phishing detection accuracy', 'zero-shot URL security classification'.
npx skillsauth add ndpvt-web/arxiv-claude-skills benchmarking-zero-shot-few-shot-phishingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to classify URLs as phishing or legitimate using the unified zero-shot and few-shot prompting framework from Hasan & BusiReddyGari (2025). Rather than relying on feature-engineered ML pipelines or blocklists, this approach passes raw URL strings directly to an LLM with a structured prompt comprising an instruction, optional balanced examples, and a query — achieving up to 94% accuracy with just six few-shot examples. The technique is especially valuable when labeled training data is scarce or when new phishing campaigns emerge faster than traditional models can retrain.
The paper's core insight is that LLMs can perform phishing URL detection through prompt-only inference — no manual feature extraction, no fine-tuning, no URL parsing into domain age, path length, or entropy features. The URL string itself contains enough lexical and structural signal (misspelled brand names, excessive subdomains, suspicious TLDs, encoded characters, abnormal path depth) for an LLM to classify it when given the right prompt structure.
The unified prompting framework concatenates three components: an Instruction (I) that assigns a cybersecurity expert role and constrains output to binary labels, optional Examples (E) that provide balanced URL-label demonstrations, and a Query (Q) containing the target URL. Temperature is set to 0 for deterministic output, and max tokens is capped at 10 to force concise classification. Few-shot examples are balanced (equal phishing and legitimate) and disjoint from the evaluation set.
The critical finding is that few-shot prompting with six balanced examples (3 phishing, 3 legitimate) substantially improves performance across all tested LLMs, boosting F1 by 2-4 percentage points over zero-shot. However, the relationship between example count and performance is model-dependent: some models peak with just one example while others improve steadily up to nine. This means the optimal few-shot configuration must be empirically validated per model.
Collect target URLs — Gather the URL(s) to classify. Preserve the full URL string including protocol, subdomains, path, query parameters, and fragments. Do not normalize, truncate, or strip components.
Construct the system instruction — Use a role-constrained prompt that assigns cybersecurity expertise and forces binary output:
You are a cybersecurity expert. Respond only with 0 for phishing or 1 for legitimate.
Select few-shot examples (if available) — Choose 6 balanced examples: 3 known phishing URLs and 3 known legitimate URLs. Ensure examples are representative of common phishing patterns (typosquatting, subdomain abuse, path mimicry) and common legitimate patterns (well-known domains, standard paths). Keep examples disjoint from any URLs being classified.
Format the few-shot examples block — Structure each example as a URL-label pair with descriptive labels:
URL: http://secure-bankofamerica.com.verify.xyz/login Answer: 0 (phishing)
URL: https://www.google.com/search?q=weather Answer: 1 (legitimate)
URL: http://paypa1-verify.com/update-info Answer: 0 (phishing)
URL: https://github.com/anthropics/claude-code Answer: 1 (legitimate)
URL: http://microsoft-365.account-verify.ru/signin Answer: 0 (phishing)
URL: https://stackoverflow.com/questions/12345 Answer: 1 (legitimate)
Construct the query — Format the target URL for classification:
URL: {target_url} Is this URL phishing or legitimate? Respond with 0 or 1.
Set inference parameters — Use temperature=0 for deterministic classification and limit max output tokens to 10 to prevent verbose explanations that complicate parsing.
Parse the response — Extract the binary label (0 or 1) from the model output. If the response is unparseable (contains neither 0 nor 1 as a standalone token), flag it for manual review rather than guessing.
Compute evaluation metrics (for batch classification) — When classifying a dataset, compute accuracy, precision, recall, F1 (macro-averaged), AUROC, and AUPRC using scikit-learn. Report all six metrics — accuracy alone is misleading on imbalanced data.
Analyze precision-recall trade-offs — Few-shot prompting typically increases precision (fewer false positives) at a slight cost to recall (more missed phishing). Decide which trade-off fits the deployment context: high recall for user-facing filters, high precision for automated blocking.
Iterate on example selection — If performance is below target, experiment with different few-shot examples. Prioritize examples that represent the phishing patterns most common in your target domain (e.g., financial brand impersonation, OAuth phishing, URL shortener abuse).
Example 1: Zero-shot single URL classification
User: "Is this URL phishing? http://amaz0n-prime.account-update.xyz/verify?id=8823"
Approach:
Output:
Classification: 0 (phishing)
Indicators detected:
- Brand impersonation: "amaz0n" uses character substitution (0 for o)
- Suspicious domain: "account-update.xyz" is not an Amazon-owned domain
- Deceptive path: "/verify" mimics a legitimate account verification flow
- Query parameter: "id=8823" suggests phishing kit session tracking
Example 2: Few-shot batch classification with evaluation
User: "I have a CSV of 500 URLs. Build me a phishing detection pipeline using few-shot prompting and report accuracy metrics."
Approach:
Output:
import pandas as pd
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, average_precision_score)
# Load data
df = pd.read_csv("urls.csv") # columns: url, label (0=phishing, 1=legitimate)
# Hold out 6 examples for few-shot (3 per class), disjoint from eval set
examples_phishing = df[df.label == 0].sample(3, random_state=42)
examples_legit = df[df.label == 1].sample(3, random_state=42)
few_shot_ids = set(examples_phishing.index) | set(examples_legit.index)
eval_df = df.drop(index=few_shot_ids)
SYSTEM_PROMPT = ("You are a cybersecurity expert. "
"Respond only with 0 for phishing or 1 for legitimate.")
def build_few_shot_block(phishing_examples, legit_examples):
lines = []
for _, row in phishing_examples.iterrows():
lines.append(f"URL: {row.url} Answer: 0 (phishing)")
for _, row in legit_examples.iterrows():
lines.append(f"URL: {row.url} Answer: 1 (legitimate)")
return "\n".join(lines)
def classify_url(url, few_shot_block, client):
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"{few_shot_block}\n\nURL: {url}\n"
"Is this URL phishing or legitimate? Respond with 0 or 1."}
]
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=10,
temperature=0,
messages=messages
)
text = response.content[0].text.strip()
if "0" in text:
return 0
elif "1" in text:
return 1
return None # unparseable
few_shot_block = build_few_shot_block(examples_phishing, examples_legit)
# Classify all evaluation URLs
predictions = []
for _, row in eval_df.iterrows():
pred = classify_url(row.url, few_shot_block, client)
predictions.append(pred)
eval_df["pred"] = predictions
valid = eval_df.dropna(subset=["pred"])
# Report metrics
print(f"Accuracy: {accuracy_score(valid.label, valid.pred):.4f}")
print(f"Precision: {precision_score(valid.label, valid.pred, average='macro'):.4f}")
print(f"Recall: {recall_score(valid.label, valid.pred, average='macro'):.4f}")
print(f"F1: {f1_score(valid.label, valid.pred, average='macro'):.4f}")
print(f"AUROC: {roc_auc_score(valid.label, valid.pred):.4f}")
print(f"AUPRC: {average_precision_score(valid.label, valid.pred):.4f}")
print(f"Unparseable: {eval_df.pred.isna().sum()} / {len(eval_df)}")
Example 3: Comparing zero-shot vs few-shot for a custom URL list
User: "Compare zero-shot and few-shot phishing detection on my URL list and show which is better."
Approach:
Output:
| Metric | Zero-shot | Few-shot (6 examples) | Delta |
|-----------|-----------|-----------------------|--------|
| Accuracy | 0.8760 | 0.9250 | +0.049 |
| Precision | 0.8780 | 0.9030 | +0.025 |
| Recall | 0.8730 | 0.9530 | +0.080 |
| F1 | 0.8760 | 0.9270 | +0.051 |
| AUROC | 0.8760 | 0.9250 | +0.049 |
| AUPRC | 0.9070 | 0.9400 | +0.033 |
Recommendation: Few-shot prompting with 6 balanced examples improves
all metrics. The largest gain is in recall (+8%), meaning fewer phishing
URLs are missed. Use few-shot for production deployments.
Hasan, N. & BusiReddyGari, P. (2025). Benchmarking Large Language Models for Zero-shot and Few-shot Phishing URL Detection. arXiv:2602.02641. NeurIPS 2025 LAW Workshop. https://arxiv.org/abs/2602.02641v1 — See Tables 1-3 for per-model performance under balanced and imbalanced settings, and Section 4 for the unified prompting framework specification.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".