skills/skillxiv-v0.0.2-claude-opus-4.6/audio-flamingo-3-multimodal-reasoning/SKILL.md
Build fully open audio-language models supporting reasoning over speech, sound, and music with 10-minute long-form comprehension and multi-turn conversation capabilities. Use when you need to process audio modalities alongside text for complex reasoning tasks across speech recognition, sound classification, and music analysis.
npx skillsauth add ADu2021/skillXiv audio-flamingo-3-multimodal-reasoningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Large language models have transformed text understanding, but audio remains underexplored in open-source systems. Existing audio models either focus narrowly on speech recognition or rely on proprietary architectures. AF3 addresses this gap by combining unified audio encoding with LLM-based reasoning, enabling models to handle speech, sound effects, and music simultaneously without separate task-specific encoders.
The key insight is that a single unified audio encoder trained on joint representation learning outperforms modality-specific designs. By pairing this with multi-stage curriculum training and novel reasoning-focused datasets, AF3 achieves state-of-the-art results across 20+ benchmarks using only open-source training data.
AF3 builds on the principle that audio understanding should mirror language understanding: compress information through a unified encoder, then leverage a pre-trained LLM backbone for reasoning. The AF-Whisper encoder processes all audio modalities through a single Transformer architecture with attached decoder layers, learning joint representations rather than splitting audio into separate processing streams.
The system chains capabilities vertically: encoder learns feature compression, adapter bridges to language space, and the LLM (Qwen-2.5-7B) performs reasoning. Training progresses through five curriculum stages, gradually expanding context lengths from 30 seconds to 10 minutes and introducing reasoning supervision.
This initial stage trains only the audio adaptor while freezing encoder and LLM, establishing the mapping between audio feature space and language embeddings.
import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer
# Initialize frozen encoder and LLM
encoder = AutoModel.from_pretrained("openai/whisper-large-v3")
llm = AutoModel.from_pretrained("Qwen/Qwen2.5-7B")
# Trainable adaptor: project encoder outputs to LLM embedding dim
class AudioAdaptor(nn.Module):
def __init__(self, encoder_dim=1024, llm_dim=4096):
super().__init__()
self.proj = nn.Linear(encoder_dim, llm_dim)
self.norm = nn.LayerNorm(llm_dim)
def forward(self, audio_features):
# audio_features: (batch, seq_len, encoder_dim)
adapted = self.proj(audio_features)
return self.norm(adapted)
adaptor = AudioAdaptor()
optimizer = torch.optim.AdamW(adaptor.parameters(), lr=1e-4)
The adaptor learns to compress audio semantic information into the language embedding space through reconstruction loss on the frozen LLM's internal representations.
Unfreeze the encoder while keeping LLM frozen, allowing the encoder to adapt its representations toward the language space while the adaptor continues refining.
# Enable gradients for encoder
for param in encoder.parameters():
param.requires_grad = True
# Create unified optimizer for encoder + adaptor
params = list(encoder.parameters()) + list(adaptor.parameters())
optimizer = torch.optim.AdamW(params, lr=1e-4, weight_decay=0.1)
# Training loop for one epoch
encoder.train()
adaptor.train()
llm.eval()
for batch in train_loader:
audio_input = batch['audio'] # (batch, seq_len, n_mels)
# Encode audio
audio_features = encoder(audio_input) # (batch, seq_len, 1024)
# Adapt to LLM space
adapted = adaptor(audio_features)
# Forward through frozen LLM
with torch.no_grad():
lm_output = llm(inputs_embeds=adapted)
# Loss based on language modeling objective
loss = language_modeling_loss(lm_output, target_tokens)
loss.backward()
optimizer.step()
Unfreeze the entire model and train on the expanded AudioSkills-XL dataset containing reasoning-focused QA pairs, now with longer audio contexts.
# Full model fine-tuning with gradient checkpointing
llm.train()
encoder.train()
adaptor.train()
# Enable gradient checkpointing to fit longer sequences
llm.gradient_checkpointing_enable()
# Use longer batch size with larger context
for batch in audioSkills_loader:
audio_input = batch['audio'] # up to 150 seconds (2.5 min)
question = batch['question']
answer = batch['answer']
# Process audio through full pipeline
audio_features = encoder(audio_input)
adapted = adaptor(audio_features)
# Construct prompt with both audio representation and question
lm_input = torch.cat([adapted, question_embed], dim=1)
output = llm.generate(
inputs_embeds=lm_input,
max_new_tokens=256,
temperature=0.7
)
# Compute loss on answer generation
loss = compute_lm_loss(output, answer)
loss.backward()
optimizer.step()
Apply Low-Rank Adaptation to efficiently handle 10-minute audio contexts while minimizing training cost, combined with AF-Think reasoning supervision.
from peft import get_peft_model, LoraConfig
# Apply LoRA to attention and feed-forward in encoder and LLM
lora_config = LoraConfig(
r=64,
lora_alpha=128,
target_modules=["q_proj", "v_proj", "dense"],
lora_dropout=0.05,
bias="none"
)
encoder_lora = get_peft_model(encoder, lora_config)
llm_lora = get_peft_model(llm, lora_config)
# Train with long-audio dataset (up to 10 minutes)
for batch in longAudio_loader:
# Audio up to 600 seconds; use streaming/chunked processing
audio_chunks = chunk_audio(batch['audio'], chunk_size=30) # 30s chunks
# Process chunks sequentially with KV cache reuse
audio_reps = []
for chunk in audio_chunks:
chunk_rep = encoder_lora(chunk)
adapted_chunk = adaptor(chunk_rep)
audio_reps.append(adapted_chunk)
# Concatenate all chunks
full_audio_rep = torch.cat(audio_reps, dim=1)
# Generate with reasoning prefix (AF-Think)
output = llm_lora.generate(
inputs_embeds=full_audio_rep,
max_new_tokens=512,
output_scores=True
)
Train on AF-Chat dataset containing multi-turn conversations with multiple audio clips per turn.
# Create conversation formatter
def format_conversation(conversation):
# conversation: [{"role": "user", "audios": [...], "text": "..."}, ...]
formatted = []
for turn in conversation:
role = turn['role']
audios = turn['audios']
text = turn['text']
# Encode all audios in turn
audio_reps = []
for audio in audios:
rep = encoder(audio)
adapted = adaptor(rep)
audio_reps.append(adapted)
formatted.append({
'role': role,
'audio_reps': audio_reps,
'text': text
})
return formatted
# Training with conversation format
for batch in af_chat_loader:
formatted = format_conversation(batch['conversation'])
# Build full conversation context
conv_input = []
for turn in formatted:
conv_input.extend(turn['audio_reps'])
conv_input.append(tokenizer.encode(turn['text']))
output = llm(inputs_embeds=torch.cat(conv_input, dim=1))
loss = cross_entropy_loss(output, target_tokens)
loss.backward()
optimizer.step()
| Parameter | Value | Notes | |-----------|-------|-------| | Learning Rate | 1e-4 | Used across all stages; decayed by 0.1 factor | | Batch Size | 1024 | For AF-Whisper pre-training; reduced for longer contexts | | Training Infrastructure | 128 × A100 (80GB) | Distributed training across 128 GPUs | | Audio Frame Rate | 50Hz | Post-pooling from mel-spectrogram processing | | Mel-spectrogram Channels | 128 | Provides sufficient frequency resolution | | Max Context Length | 10 minutes | Progressive expansion: 30s → 2.5m → 10m | | TTS Inference Speed | ~0.15s | Time-to-first-token in streaming mode |
Xie, Y., Tian, Y., Wang, Z., et al. (2024). Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models. arXiv preprint arXiv:2507.08128.
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.