skills/skillxiv-v0.0.2-claude-opus-4.6/adaptive-data-refinement-vlm-long-tail/SKILL.md
Mitigate long-tail distribution problems in VLM training data through adaptive rebalancing and diffusion-based synthesis. Uses entity distribution analysis to identify head/tail imbalance and applies targeted data augmentation, improving LLaVA 1.5 performance by 4.36% without increasing training data volume.
npx skillsauth add ADu2021/skillXiv adaptive-data-refinement-vlm-long-tailInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Large Vision-Language Models (LVLMs) suffer from long-tail distribution problems where common concepts are overrepresented while rare concepts are underrepresented in training data. The ADR (Adaptive Data Refinement) framework addresses this by: (1) analyzing entity distributions across token, object, co-occurrence, and query perspectives; (2) adaptively rebalancing redundant head-category data; and (3) synthesizing new samples for underrepresented tail categories using diffusion models.
ADR consists of two sequential stages:
Analyze training data to identify head vs. tail entity distributions across multiple perspectives:
# Entity distribution analysis for long-tail detection
from collections import defaultdict, Counter
import numpy as np
def analyze_entity_distributions(dataset, perspectives=['token', 'object']):
"""
Construct entity distributions across multiple perspectives.
Args:
dataset: list of (image, text) tuples from VLM training
perspectives: analysis dimensions ['token', 'object', 'cooccurrence']
Returns:
dict with distributions and head/tail split points
"""
distributions = {}
if 'token' in perspectives:
# Tokenize all captions and count vocabulary frequency
token_freq = Counter()
for _, caption in dataset:
tokens = caption.lower().split()
token_freq.update(tokens)
# Identify head/tail split (e.g., 80/20 principle)
freq_sorted = sorted(token_freq.values(), reverse=True)
cumsum = np.cumsum(freq_sorted)
total = cumsum[-1]
head_idx = np.where(cumsum >= 0.8 * total)[0][0]
head_tokens = set([t for t, f in token_freq.items()
if f >= freq_sorted[head_idx]])
distributions['token'] = {
'freq': token_freq,
'head_tokens': head_tokens,
'tail_tokens': set(token_freq.keys()) - head_tokens,
'head_count': sum(f for t, f in token_freq.items()
if t in head_tokens),
}
if 'object' in perspectives:
# Run object detection on images and aggregate
object_freq = Counter()
for image, _ in dataset:
objects = detect_objects(image) # Requires detector
object_freq.update(objects)
# Find head/tail objects
freq_sorted = sorted(object_freq.values(), reverse=True)
cumsum = np.cumsum(freq_sorted)
total = cumsum[-1]
head_idx = np.where(cumsum >= 0.8 * total)[0][0]
head_objects = set([o for o, f in object_freq.items()
if f >= freq_sorted[head_idx]])
distributions['object'] = {
'freq': object_freq,
'head_objects': head_objects,
'tail_objects': set(object_freq.keys()) - head_objects,
}
return distributions
Create a weighting scheme to downsample head categories and preserve tail categories:
def construct_probability_dictionary(dataset, distributions,
head_downsample_ratio=0.5):
"""
Build probability weights for each sample based on entity prevalence.
Args:
dataset: training data
distributions: entity frequency distributions
head_downsample_ratio: fraction of head samples to keep
Returns:
sample_weights: probability of keeping each sample
"""
token_dist = distributions.get('token', {})
head_tokens = token_dist.get('head_tokens', set())
sample_weights = []
for image, caption in dataset:
tokens = set(caption.lower().split())
# Count how many head tokens in this sample
head_token_count = len(tokens & head_tokens)
total_token_count = len(tokens)
# Samples with mostly head content get downsampled
if total_token_count > 0:
head_ratio = head_token_count / total_token_count
else:
head_ratio = 0.0
# Downweight head-heavy samples, preserve tail
if head_ratio > 0.8: # Mostly head content
weight = head_downsample_ratio
elif head_ratio > 0.5: # Mixed
weight = 0.75
else: # Mostly tail content
weight = 1.0
sample_weights.append(weight)
# Normalize to probabilities
sample_weights = np.array(sample_weights)
sample_weights = sample_weights / sample_weights.sum()
return sample_weights
Use diffusion models to generate images for underrepresented tail entities:
def synthesize_tail_data(tail_entities, num_samples_per_entity,
text_generator, diffusion_model):
"""
Generate synthetic images for tail entity categories.
Args:
tail_entities: list of underrepresented object/concept names
num_samples_per_entity: how many images to generate per entity
text_generator: LLM for generating diverse captions
diffusion_model: DDPM for image generation
Returns:
list of (synthetic_image, caption) tuples
"""
synthetic_data = []
for entity in tail_entities:
# Generate diverse text descriptions for this entity
prompts = []
for i in range(num_samples_per_entity):
# Use LLM to generate varied descriptions
prompt = text_generator(
f"Generate a diverse visual description of a {entity} "
f"variation {i+1}. Include scene context, colors, pose.",
temperature=0.8
)
prompts.append(prompt)
# Use diffusion model to generate images from text
for prompt in prompts:
# Add entity keyword to ensure relevance
full_prompt = f"{entity}, {prompt}"
# Generate with diffusion model
image = diffusion_model.generate(
prompt=full_prompt,
num_steps=50,
guidance_scale=7.5,
seed=None # Randomize for diversity
)
synthetic_data.append((image, full_prompt))
return synthetic_data
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.