skills/bass-benchmarking-audio-lms/SKILL.md
Build evaluation benchmarks for audio language models using the BASS methodology — structured task taxonomies across structural segmentation, lyric transcription, musicological analysis, and artist collaboration. Trigger phrases: 'benchmark audio model', 'evaluate music understanding', 'music LLM evaluation', 'audio reasoning benchmark', 'test music AI capabilities', 'build music benchmark dataset'
npx skillsauth add ndpvt-web/arxiv-claude-skills bass-benchmarking-audio-lmsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design, implement, and run structured evaluation benchmarks for audio language models following the BASS framework. BASS decomposes music understanding into four testable categories — structural segmentation, lyric transcription, musicological analysis, and artist collaboration — spanning 12 concrete tasks with task-specific metrics. Use this skill to build benchmark pipelines, write evaluation harnesses, craft prompt templates with paraphrase diversity, and interpret model performance gaps across musical reasoning dimensions.
BASS structures music understanding evaluation around a four-category, twelve-task taxonomy. Rather than treating "music understanding" as a monolithic capability, it factors it into structural segmentation (can the model identify verse/chorus/bridge boundaries and labels?), lyric transcription (can it extract lyrics section-by-section?), musicological analysis (can it identify genres, instruments, and vocal attributes?), and artist collaboration (can it count, localize, and attribute individual artists in multi-performer tracks?). Each category uses distinct question formats and metrics suited to its nature — IoU for temporal segmentation, inverted Word Error Rate for transcription, normalized exact match for classification, and tolerance-windowed exact match for timestamp tasks.
The critical design choice is metric normalization to enable cross-task comparison. Raw accuracy on a 4-choice MCQ (25% random baseline) is not comparable to IoU on segmentation (near-0% random baseline). BASS normalizes classification accuracy as (EMA - R) / (1 - R) where R is random chance, making a 50% score on a 4-choice task equivalent to 33% normalized — removing the inflated floor. This lets you rank models across all 12 tasks on a common scale. The second key insight is prompt paraphrasing: each task uses 10 natural-language prompt variants (paraphrases of the same instruction) to measure robustness to phrasing, not just capability on a single template.
The benchmark reveals that current frontier models (Gemini 2.5 Pro scoring 26.5% normalized overall) handle text-adjacent tasks like lyric transcription far better than structural reasoning. This gap — strong linguistic priors but weak musical structure modeling — is the actionable diagnostic: it tells builders where to invest training effort and tells evaluators which tasks actually differentiate models.
Define your task taxonomy. Map your evaluation domain into 3-5 orthogonal categories (e.g., structural, semantic, attribution). For music, use BASS's four: structural segmentation, lyric transcription, musicological analysis, artist collaboration. For other audio domains, adapt the pattern — e.g., for podcast understanding: topic segmentation, speaker diarization, content summarization, fact verification.
Design 2-4 concrete tasks per category. Each task must have a single unambiguous output format. Structural segmentation splits into Full Structural Segmentation (FSS: segment the entire song) and Section Structural Segmentation (SSS: identify instances of one section type). Musicological analysis splits into single-genre detection, pairwise-genre detection, genre attribution, and genre dominance ranking. Keep tasks graduated in difficulty within each category.
Select task-appropriate metrics. Match the metric to the output type:
1 / (1 + WER) so higher is better(EMA - R) / (1 - R)Curate ground-truth data from authoritative sources. Use existing expert-annotated datasets rather than generating labels with LLMs. BASS draws structural annotations from the Harmonix Set (professional music analysts), lyrics from Genius (crowd-verified), musicological attributes from MGPHot (professional musicologist 0-5 ratings), and artist collaboration data from CoSoD. Always verify annotation quality on a sample before scaling.
Construct questions with distractor design for MCQ tasks. Use 4-8 answer options (more options reduce random-chance inflation). Draw distractors from the same domain — for genre classification, use plausible alternative genres from the dataset, not absurd options. For open-ended tasks, define the expected output schema precisely (e.g., "return a JSON array of {section_label, start_time, end_time}").
Write 10 prompt paraphrases per task. Do not evaluate on a single prompt template. Write 10 natural-language rephrasings of each task instruction. Example for structural segmentation:
Build the evaluation harness. Write code that: (a) loads audio files or references, (b) randomly selects one of the 10 prompt paraphrases per question, (c) calls the model API with audio + text prompt, (d) parses the model's response into the expected output schema, (e) computes the task-specific metric, and (f) stores raw and normalized scores.
Normalize scores for cross-task comparison. Apply (score - random_baseline) / (1 - random_baseline) to all classification metrics. For IoU and inverted WER, the random baseline is effectively 0, so normalization is a no-op. Report both raw and normalized scores.
Run evaluation across models and aggregate results. Test each model on all tasks, computing per-task scores, per-category averages, and an overall normalized average. Present results in a matrix (models x tasks) to reveal category-level strengths and weaknesses.
Analyze failure modes by category. The diagnostic value of BASS-style benchmarks is in the gaps. A model strong on lyric transcription but weak on structural segmentation is leveraging linguistic priors without modeling temporal structure. A model weak on artist collaboration but strong on genre detection lacks voice discrimination. Use these diagnostics to guide model improvement or selection.
Example 1: Building a music understanding benchmark evaluation harness
User: "I want to evaluate Gemini and Qwen on music understanding tasks. Help me build a benchmark."
Approach:
Output:
# bass_evaluator.py
import json
import random
from dataclasses import dataclass
from typing import Literal
@dataclass
class BenchmarkQuestion:
task: str
category: str
audio_path: str
prompt_variants: list[str]
ground_truth: dict
question_type: Literal["open", "mcq"]
mcq_options: list[str] | None = None
TASK_METRICS = {
"structural_segmentation": "iou",
"lyric_transcription": "inverted_wer",
"musicological_analysis": "normalized_ema",
"artist_collaboration": "tolerance_match",
}
PROMPT_VARIANTS = {
"fss": [
"Divide this track into its major sections. For each, provide the section label, start timestamp, and end timestamp.",
"Segment the entire song into structural parts (intro, verse, chorus, bridge, outro, etc.) with time boundaries.",
"Identify all structural sections in this song and their timestamps.",
# ... 7 more paraphrases
],
"sgd": [
"Listen to this song. Which of the following genres is most dominant? Options: {options}",
"What is the primary genre of this track? Choose from: {options}",
"Of the genres listed below, which best describes this song? {options}",
# ... 7 more paraphrases
],
}
def compute_iou(predicted_segments: list[dict], gt_segments: list[dict]) -> float:
"""IoU between predicted and ground-truth temporal segments with label matching."""
total_iou, matched = 0.0, 0
for gt in gt_segments:
best_iou = 0.0
for pred in predicted_segments:
if pred["label"].lower() == gt["label"].lower():
inter_start = max(pred["start"], gt["start"])
inter_end = min(pred["end"], gt["end"])
intersection = max(0, inter_end - inter_start)
union = (pred["end"] - pred["start"]) + (gt["end"] - gt["start"]) - intersection
best_iou = max(best_iou, intersection / union if union > 0 else 0)
total_iou += best_iou
matched += 1
return total_iou / matched if matched > 0 else 0.0
def compute_inverted_wer(predicted: str, reference: str) -> float:
"""Inverted WER so higher is better. Uses 1/(1+WER)."""
# Requires jiwer or similar WER library
from jiwer import wer
error_rate = wer(reference, predicted)
return 1.0 / (1.0 + error_rate)
def normalize_ema(accuracy: float, num_choices: int) -> float:
"""Normalize exact match accuracy by removing random-chance floor."""
random_chance = 1.0 / num_choices
if accuracy <= random_chance:
return 0.0
return (accuracy - random_chance) / (1.0 - random_chance)
def evaluate_question(question: BenchmarkQuestion, model_response: str) -> float:
metric = TASK_METRICS[question.category]
if metric == "iou":
predicted = parse_segments(model_response)
return compute_iou(predicted, question.ground_truth["segments"])
elif metric == "inverted_wer":
return compute_inverted_wer(model_response, question.ground_truth["lyrics"])
elif metric == "normalized_ema":
correct = model_response.strip().lower() == question.ground_truth["answer"].lower()
return normalize_ema(float(correct), len(question.mcq_options))
elif metric == "tolerance_match":
pred_time = parse_timestamp(model_response)
gt_time = question.ground_truth["timestamp"]
return 1.0 if abs(pred_time - gt_time) <= 3.0 else 0.0
def run_benchmark(questions: list[BenchmarkQuestion], model_fn) -> dict:
"""Run full benchmark, returning per-task and per-category scores."""
results = {}
for q in questions:
prompt = random.choice(q.prompt_variants)
response = model_fn(q.audio_path, prompt)
score = evaluate_question(q, response)
results.setdefault(q.task, []).append(score)
task_scores = {task: sum(s) / len(s) for task, s in results.items()}
return task_scores
Example 2: Designing an evaluation for a custom audio domain (podcast understanding)
User: "I want to adapt the BASS approach to evaluate models on podcast understanding."
Approach:
Output:
Adapted Taxonomy:
Category | Tasks | Metric
--------------------------- | ---------------------------------- | ----------------------
Structural Segmentation | Topic boundary detection | IoU (timestamps)
| Segment labeling (intro/ad/topic) | IoU (timestamps+labels)
Content Transcription | Full transcript | Inverted WER
| Section-specific transcript | Inverted WER
Semantic Analysis | Topic classification (MCQ) | Normalized EMA
| Sentiment/tone detection (MCQ) | Normalized EMA
| Key claim extraction (open) | ROUGE-L
Speaker Attribution | Speaker count | Exact match
| Speaker diarization | DER (Diarization Error Rate)
| Speaker-to-utterance mapping | Tolerance match (+-2s)
Prompt paraphrases for topic boundary detection (3 of 10):
1. "Identify where the main topics change in this podcast episode. Return timestamps."
2. "Segment this podcast into distinct topic sections with start and end times."
3. "At what points does the conversation shift to a new subject? List all transitions."
Normalization: MCQ tasks with 5 options use R=0.2. A model scoring 60% raw
gets (0.6 - 0.2) / (1 - 0.2) = 50% normalized.
Example 3: Analyzing benchmark results to diagnose model weaknesses
User: "My audio model scores 80% on lyric transcription but 15% on structural segmentation. What does this mean and how do I fix it?"
Approach:
Output:
Diagnosis (per BASS findings):
- High lyric transcription + low structural segmentation is the most common
failure pattern across all 14 models BASS evaluated.
- This indicates your model has strong linguistic priors (it can decode speech
from audio) but weak temporal-structural reasoning (it cannot identify
when musical sections begin/end or what they are).
Root causes:
1. Training data likely emphasizes speech/text alignment over musical form
2. Model may lack temporal attention mechanisms needed for long-range structure
3. Audio encoder may compress temporal information too aggressively
Targeted improvements:
1. Add structural segmentation examples to fine-tuning data (Harmonix Set
provides 912 songs with section annotations)
2. Augment training with explicit timestamp prediction tasks — force the
model to output start/end times, not just labels
3. Consider longer context windows — structural segmentation requires
reasoning over the full song (avg 4.16 min in BASS), while lyric
transcription can work on short segments
4. Evaluate with BASS's IoU metric during training as a validation signal
to track structural reasoning improvement independently
Paper: BASS: Benchmarking Audio LMs for Musical Structure and Semantic Reasoning — Jang et al., 2026. Look for Table 1 (normalized model scores across all 12 tasks), Table 5 (prompt templates), and Section 4.2 (metric normalization formulas). Code and data: github.com/minjang10/bass_music_benchmark.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".