external/anthropic-cybersecurity-skills/skills/detecting-deepfake-audio-in-vishing-attacks/SKILL.md
Detects AI-generated deepfake audio used in voice phishing (vishing) attacks by extracting spectral features (MFCC, spectral centroid, spectral contrast, zero-crossing rate) and classifying samples with machine learning models. Supports batch analysis of audio files, generates confidence scores, and produces forensic reports. Activates for requests involving deepfake voice detection, vishing investigation, AI-generated speech analysis, voice cloning detection, or audio authenticity verification.
npx skillsauth add seikaikyo/dash-skills detecting-deepfake-audio-in-vishing-attacksInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Do not use for text-based phishing (email/SMS); use email header analysis or URL detonation tools instead.
Normalize and prepare audio samples for feature extraction:
import librosa
import numpy as np
# Load audio, resample to 16kHz mono
y, sr = librosa.load("suspect_call.wav", sr=16000, mono=True)
# Trim silence from beginning and end
y_trimmed, _ = librosa.effects.trim(y, top_db=25)
# Normalize amplitude to [-1, 1]
y_norm = y_trimmed / np.max(np.abs(y_trimmed))
Audio preprocessing ensures consistent feature extraction across different recording conditions, microphones, and codec artifacts.
Extract the feature set that distinguishes real from synthetic speech:
Mel-Frequency Cepstral Coefficients (MFCCs):
# Extract 20 MFCCs + delta and delta-delta
mfccs = librosa.feature.mfcc(y=y_norm, sr=sr, n_mfcc=20)
mfcc_delta = librosa.feature.delta(mfccs)
mfcc_delta2 = librosa.feature.delta(mfccs, order=2)
MFCCs capture the spectral envelope of speech, representing how the vocal tract shapes sound. Deepfake audio often shows unnatural smoothness in higher-order MFCCs because neural vocoders approximate but do not perfectly replicate the acoustic resonance of a physical vocal tract.
Spectral Features:
spectral_centroid = librosa.feature.spectral_centroid(y=y_norm, sr=sr)
spectral_bandwidth = librosa.feature.spectral_bandwidth(y=y_norm, sr=sr)
spectral_contrast = librosa.feature.spectral_contrast(y=y_norm, sr=sr)
spectral_rolloff = librosa.feature.spectral_rolloff(y=y_norm, sr=sr)
zero_crossing_rate = librosa.feature.zero_crossing_rate(y_norm)
Key indicators of deepfake audio:
Aggregate frame-level features into a fixed-length vector and classify:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
def build_feature_vector(y, sr):
features = []
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=20)
for coeff in mfccs:
features.extend([np.mean(coeff), np.std(coeff), np.min(coeff), np.max(coeff)])
for feat_fn in [librosa.feature.spectral_centroid,
librosa.feature.spectral_bandwidth,
librosa.feature.spectral_rolloff,
librosa.feature.zero_crossing_rate]:
feat = feat_fn(y=y, sr=sr) if feat_fn != librosa.feature.zero_crossing_rate else feat_fn(y)
features.extend([np.mean(feat), np.std(feat), np.min(feat), np.max(feat)])
contrast = librosa.feature.spectral_contrast(y=y, sr=sr)
for band in contrast:
features.extend([np.mean(band), np.std(band)])
return np.array(features)
Classification uses an ensemble approach: Random Forest for robustness and Gradient Boosting for accuracy, with a voting mechanism to reduce false positives.
Examine time-domain artifacts that neural vocoders leave behind:
# Pitch stability analysis - deepfakes often have unnaturally stable F0
f0, voiced_flag, voiced_probs = librosa.pyin(y_norm, fmin=50, fmax=500, sr=sr)
f0_clean = f0[~np.isnan(f0)]
pitch_std = np.std(f0_clean) if len(f0_clean) > 0 else 0
pitch_jitter = np.mean(np.abs(np.diff(f0_clean))) if len(f0_clean) > 1 else 0
Real human speech exhibits natural pitch jitter (micro-variations in fundamental frequency) and shimmer (amplitude perturbations). Deepfake audio generated by Tacotron 2, VALL-E, or ElevenLabs typically shows reduced jitter and shimmer compared to genuine speech.
Generate spectrograms for manual forensic review:
import librosa.display
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
librosa.display.specshow(librosa.power_to_db(librosa.feature.melspectrogram(y=y_norm, sr=sr)),
sr=sr, ax=axes[0, 0], x_axis='time', y_axis='mel')
axes[0, 0].set_title('Mel Spectrogram')
librosa.display.specshow(mfccs, sr=sr, ax=axes[0, 1], x_axis='time')
axes[0, 1].set_title('MFCCs')
Visual inspection reveals banding artifacts in mel spectrograms, unnatural energy cutoffs above the vocoder's frequency ceiling, and periodic noise patterns in the high-frequency range that are characteristic of neural speech synthesis.
Compile findings into an actionable report:
DEEPFAKE AUDIO ANALYSIS REPORT
================================
File: suspect_executive_call.wav
Duration: 47.3 seconds
Sample Rate: 16000 Hz
Analysis Date: 2026-03-19
CLASSIFICATION RESULT
Verdict: LIKELY DEEPFAKE (confidence: 94.2%)
Ensemble Score: RF=0.91, GBT=0.97, Avg=0.94
FEATURE ANOMALIES DETECTED
- MFCC variance in coefficients 13-20: 62% below genuine baseline
- Spectral contrast (4-8 kHz): 0.23 (genuine avg: 0.41)
- Pitch jitter: 0.8 Hz (genuine avg: 2.4 Hz)
- Zero-crossing rate std: 0.003 (genuine avg: 0.011)
SPECTROGRAM ARTIFACTS
- Energy cutoff above 7.8 kHz (consistent with neural vocoder ceiling)
- Banding pattern at 50ms intervals in mel spectrogram
- Missing formant transitions at 12.4s, 23.1s, 35.7s timestamps
RECOMMENDATION
High confidence of AI-generated audio. Recommend out-of-band
verification with the purported speaker. Preserve original audio
file with chain of custody documentation for potential legal action.
| Term | Definition | |------|------------| | MFCC | Mel-Frequency Cepstral Coefficients; representation of the short-term power spectrum on a mel (perceptual) frequency scale | | Spectral Centroid | Weighted mean of frequencies present in the signal; indicates perceived brightness of a sound | | Spectral Contrast | Difference in amplitude between peaks and valleys in the spectrum across frequency sub-bands | | Vocoder | Signal processing component that synthesizes audio waveforms from acoustic features; used in TTS and voice cloning | | Pitch Jitter | Cycle-to-cycle variation in fundamental frequency; natural in human speech, reduced in synthetic speech | | Vishing | Voice phishing; social engineering attack conducted via phone calls, increasingly using AI-cloned voices | | Formant | Resonant frequencies of the vocal tract that define vowel sounds; transitions between formants are difficult for AI to replicate perfectly |
Context: CFO receives a phone call appearing to be from the CEO requesting an urgent wire transfer of $2.3M. The call came from an unknown number but the voice sounded identical to the CEO. IT security was able to obtain a recording of the call from the phone system.
Approach:
Pitfalls:
tools
Zero-Knowledge Proofs (ZKPs) allow a prover to demonstrate knowledge of a secret (such as a password or private key) without revealing the secret itself. This skill implements the Schnorr identificati
development
Configure ModSecurity WAF with OWASP Core Rule Set (CRS) for web application logging, tune rules to reduce false positives, analyze audit logs for attack detection, and implement custom SecRules for application-specific threats. The analyst configures SecRuleEngine, SecAuditEngine, and CRS paranoia levels to balance security coverage with operational stability. Activates for requests involving WAF configuration, ModSecurity rule tuning, web application audit logging, or CRS deployment.
development
Build automated alerting for vulnerability remediation SLA breaches with severity-based timelines, escalation workflows, and compliance reporting dashboards.
testing
Vulnerability remediation SLAs define mandatory timeframes for patching or mitigating identified vulnerabilities based on severity, asset criticality, and exploit availability. Effective SLA programs