Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ADu2021/adaptive-data-refinement-vlm-long-tail

Name: adaptive-data-refinement-vlm-long-tail
Author: ADu2021

skills/skillxiv-v0.0.2-claude-opus-4.6/adaptive-data-refinement-vlm-long-tail/SKILL.md

npx skillsauth add ADu2021/skillXiv adaptive-data-refinement-vlm-long-tail

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Core Concept

Large Vision-Language Models (LVLMs) suffer from long-tail distribution problems where common concepts are overrepresented while rare concepts are underrepresented in training data. The ADR (Adaptive Data Refinement) framework addresses this by: (1) analyzing entity distributions across token, object, co-occurrence, and query perspectives; (2) adaptively rebalancing redundant head-category data; and (3) synthesizing new samples for underrepresented tail categories using diffusion models.

Architecture Overview

ADR consists of two sequential stages:

Data Rebalancing (DR) Stage: Analyzes entity distributions and removes redundant samples from over-represented classes while retaining diverse examples
Data Synthesis (DS) Stage: Uses Denoising Diffusion Probabilistic Models (DDPMs) to generate new images for tail entities paired with LLM-synthesized text descriptions
Entity Distribution Analysis: Four perspectives identify imbalance: token-level vocabulary distribution, object detection distributions, co-occurrence patterns, and query-based question distributions

Implementation Steps

1. Entity Distribution Construction and Analysis

Analyze training data to identify head vs. tail entity distributions across multiple perspectives:

# Entity distribution analysis for long-tail detection
from collections import defaultdict, Counter
import numpy as np

def analyze_entity_distributions(dataset, perspectives=['token', 'object']):
    """
    Construct entity distributions across multiple perspectives.

    Args:
        dataset: list of (image, text) tuples from VLM training
        perspectives: analysis dimensions ['token', 'object', 'cooccurrence']

    Returns:
        dict with distributions and head/tail split points
    """
    distributions = {}

    if 'token' in perspectives:
        # Tokenize all captions and count vocabulary frequency
        token_freq = Counter()
        for _, caption in dataset:
            tokens = caption.lower().split()
            token_freq.update(tokens)

        # Identify head/tail split (e.g., 80/20 principle)
        freq_sorted = sorted(token_freq.values(), reverse=True)
        cumsum = np.cumsum(freq_sorted)
        total = cumsum[-1]
        head_idx = np.where(cumsum >= 0.8 * total)[0][0]
        head_tokens = set([t for t, f in token_freq.items()
                          if f >= freq_sorted[head_idx]])

        distributions['token'] = {
            'freq': token_freq,
            'head_tokens': head_tokens,
            'tail_tokens': set(token_freq.keys()) - head_tokens,
            'head_count': sum(f for t, f in token_freq.items()
                             if t in head_tokens),
        }

    if 'object' in perspectives:
        # Run object detection on images and aggregate
        object_freq = Counter()
        for image, _ in dataset:
            objects = detect_objects(image)  # Requires detector
            object_freq.update(objects)

        # Find head/tail objects
        freq_sorted = sorted(object_freq.values(), reverse=True)
        cumsum = np.cumsum(freq_sorted)
        total = cumsum[-1]
        head_idx = np.where(cumsum >= 0.8 * total)[0][0]
        head_objects = set([o for o, f in object_freq.items()
                           if f >= freq_sorted[head_idx]])

        distributions['object'] = {
            'freq': object_freq,
            'head_objects': head_objects,
            'tail_objects': set(object_freq.keys()) - head_objects,
        }

    return distributions

2. Data Rebalancing Through Probability Dictionary

Create a weighting scheme to downsample head categories and preserve tail categories:

def construct_probability_dictionary(dataset, distributions,
                                    head_downsample_ratio=0.5):
    """
    Build probability weights for each sample based on entity prevalence.

    Args:
        dataset: training data
        distributions: entity frequency distributions
        head_downsample_ratio: fraction of head samples to keep

    Returns:
        sample_weights: probability of keeping each sample
    """
    token_dist = distributions.get('token', {})
    head_tokens = token_dist.get('head_tokens', set())

    sample_weights = []

    for image, caption in dataset:
        tokens = set(caption.lower().split())

        # Count how many head tokens in this sample
        head_token_count = len(tokens & head_tokens)
        total_token_count = len(tokens)

        # Samples with mostly head content get downsampled
        if total_token_count > 0:
            head_ratio = head_token_count / total_token_count
        else:
            head_ratio = 0.0

        # Downweight head-heavy samples, preserve tail
        if head_ratio > 0.8:  # Mostly head content
            weight = head_downsample_ratio
        elif head_ratio > 0.5:  # Mixed
            weight = 0.75
        else:  # Mostly tail content
            weight = 1.0

        sample_weights.append(weight)

    # Normalize to probabilities
    sample_weights = np.array(sample_weights)
    sample_weights = sample_weights / sample_weights.sum()

    return sample_weights

3. Diffusion-Based Visual Data Synthesis for Tail Categories

Use diffusion models to generate images for underrepresented tail entities:

def synthesize_tail_data(tail_entities, num_samples_per_entity,
                        text_generator, diffusion_model):
    """
    Generate synthetic images for tail entity categories.

    Args:
        tail_entities: list of underrepresented object/concept names
        num_samples_per_entity: how many images to generate per entity
        text_generator: LLM for generating diverse captions
        diffusion_model: DDPM for image generation

    Returns:
        list of (synthetic_image, caption) tuples
    """
    synthetic_data = []

    for entity in tail_entities:
        # Generate diverse text descriptions for this entity
        prompts = []
        for i in range(num_samples_per_entity):
            # Use LLM to generate varied descriptions
            prompt = text_generator(
                f"Generate a diverse visual description of a {entity} "
                f"variation {i+1}. Include scene context, colors, pose.",
                temperature=0.8
            )
            prompts.append(prompt)

        # Use diffusion model to generate images from text
        for prompt in prompts:
            # Add entity keyword to ensure relevance
            full_prompt = f"{entity}, {prompt}"

            # Generate with diffusion model
            image = diffusion_model.generate(
                prompt=full_prompt,
                num_steps=50,
                guidance_scale=7.5,
                seed=None  # Randomize for diversity
            )

            synthetic_data.append((image, full_prompt))

    return synthetic_data

Practical Guidance

When to Use ADR

Training vision-language models on diverse datasets with natural long-tail distributions
Need to improve tail category performance without adding more real data
Models like LLaVA where object/concept distribution is imbalanced
Want to improve generalization on rare visual concepts

When NOT to Use

Datasets already well-balanced across categories
Where synthetic data quality is critical (medical imaging, etc.)
When tail categories should remain underrepresented by design
Very small datasets where rebalancing statistics are unreliable

Hyperparameters & Configuration

Head threshold percentile: 80% (80/20 Pareto principle); adjust based on data skewness
Head downsample ratio: 0.5 (keep 50% of head samples); balance between coverage and tail emphasis
Synthesis samples per entity: 5-10 per tail entity; more diversity helps generalization
Entity detection model: YOLOv8 or similar for object detection
Diffusion steps: 50 steps for quality; 25 for speed
Guidance scale: 7.5 for text-image alignment; higher values (10+) strengthen entity presence

Common Pitfalls

Over-rebalancing: Completely removing head categories hurts overall accuracy; keep some head data
Poor caption quality for synthesis: LLM-generated captions must be descriptive; use temperature 0.7-0.8
Synthetic data mode collapse: Diffusion models may generate similar images; use seed randomization
Entity detection errors: Object detector mistakes propagate; validate detection on sample images
Ignoring statistical significance: Small tail categories may need higher synthesis count for reliability
Validation set bias: Ensure test set reflects original distribution, not rebalanced distribution

Reference

Kingma & Welling. 2014. Auto-Encoding Variational Bayes (VAE foundations).
Ho et al. 2020. Denoising Diffusion Probabilistic Models (DDPM).
Liu et al. 2019. Decoupling Representation and Classifier for Long-Tailed Recognition.
LLaVA project: https://github.com/haotian-liu/LLaVA
Project page: https://vlmlt.github.io/

ADu2021/adaptive-data-refinement-vlm-long-tail

skills/skillxiv-v0.0.2-claude-opus-4.6/adaptive-data-refinement-vlm-long-tail/SKILL.md

Mitigate long-tail distribution problems in VLM training data through adaptive rebalancing and diffusion-based synthesis. Uses entity distribution analysis to identify head/tail imbalance and applies targeted data augmentation, improving LLaVA 1.5 performance by 4.36% without increasing training data volume.

2 stars

testing

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add ADu2021/skillXiv adaptive-data-refinement-vlm-long-tail

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 16, 2026, 3:02 PM4.3s1 file scanned

SKILL.md

name:: adaptive-data-refinement-vlm-long-tail
title:: From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration
version:: 0.0.2
engine:: skillxiv-v0.0.2-claude-opus-4.6
license:: MIT
url:: https://arxiv.org/abs/2503.12821
keywords:: [Vision-Language Models, Long-Tail Learning, Data Rebalancing, Diffusion Models, LVLM Training]
description:: Mitigate long-tail distribution problems in VLM training data through adaptive rebalancing and diffusion-based synthesis. Uses entity distribution analysis to identify head/tail imbalance and applies targeted data augmentation, improving LLaVA 1.5 performance by 4.36% without increasing training data volume.

Core Concept

Architecture Overview

ADR consists of two sequential stages:

Data Rebalancing (DR) Stage: Analyzes entity distributions and removes redundant samples from over-represented classes while retaining diverse examples
Data Synthesis (DS) Stage: Uses Denoising Diffusion Probabilistic Models (DDPMs) to generate new images for tail entities paired with LLM-synthesized text descriptions
Entity Distribution Analysis: Four perspectives identify imbalance: token-level vocabulary distribution, object detection distributions, co-occurrence patterns, and query-based question distributions

Implementation Steps

1. Entity Distribution Construction and Analysis

Analyze training data to identify head vs. tail entity distributions across multiple perspectives:

# Entity distribution analysis for long-tail detection
from collections import defaultdict, Counter
import numpy as np

def analyze_entity_distributions(dataset, perspectives=['token', 'object']):
    """
    Construct entity distributions across multiple perspectives.

    Args:
        dataset: list of (image, text) tuples from VLM training
        perspectives: analysis dimensions ['token', 'object', 'cooccurrence']

    Returns:
        dict with distributions and head/tail split points
    """
    distributions = {}

    if 'token' in perspectives:
        # Tokenize all captions and count vocabulary frequency
        token_freq = Counter()
        for _, caption in dataset:
            tokens = caption.lower().split()
            token_freq.update(tokens)

        # Identify head/tail split (e.g., 80/20 principle)
        freq_sorted = sorted(token_freq.values(), reverse=True)
        cumsum = np.cumsum(freq_sorted)
        total = cumsum[-1]
        head_idx = np.where(cumsum >= 0.8 * total)[0][0]
        head_tokens = set([t for t, f in token_freq.items()
                          if f >= freq_sorted[head_idx]])

        distributions['token'] = {
            'freq': token_freq,
            'head_tokens': head_tokens,
            'tail_tokens': set(token_freq.keys()) - head_tokens,
            'head_count': sum(f for t, f in token_freq.items()
                             if t in head_tokens),
        }

    if 'object' in perspectives:
        # Run object detection on images and aggregate
        object_freq = Counter()
        for image, _ in dataset:
            objects = detect_objects(image)  # Requires detector
            object_freq.update(objects)

        # Find head/tail objects
        freq_sorted = sorted(object_freq.values(), reverse=True)
        cumsum = np.cumsum(freq_sorted)
        total = cumsum[-1]
        head_idx = np.where(cumsum >= 0.8 * total)[0][0]
        head_objects = set([o for o, f in object_freq.items()
                           if f >= freq_sorted[head_idx]])

        distributions['object'] = {
            'freq': object_freq,
            'head_objects': head_objects,
            'tail_objects': set(object_freq.keys()) - head_objects,
        }

    return distributions

2. Data Rebalancing Through Probability Dictionary

Create a weighting scheme to downsample head categories and preserve tail categories:

def construct_probability_dictionary(dataset, distributions,
                                    head_downsample_ratio=0.5):
    """
    Build probability weights for each sample based on entity prevalence.

    Args:
        dataset: training data
        distributions: entity frequency distributions
        head_downsample_ratio: fraction of head samples to keep

    Returns:
        sample_weights: probability of keeping each sample
    """
    token_dist = distributions.get('token', {})
    head_tokens = token_dist.get('head_tokens', set())

    sample_weights = []

    for image, caption in dataset:
        tokens = set(caption.lower().split())

        # Count how many head tokens in this sample
        head_token_count = len(tokens & head_tokens)
        total_token_count = len(tokens)

        # Samples with mostly head content get downsampled
        if total_token_count > 0:
            head_ratio = head_token_count / total_token_count
        else:
            head_ratio = 0.0

        # Downweight head-heavy samples, preserve tail
        if head_ratio > 0.8:  # Mostly head content
            weight = head_downsample_ratio
        elif head_ratio > 0.5:  # Mixed
            weight = 0.75
        else:  # Mostly tail content
            weight = 1.0

        sample_weights.append(weight)

    # Normalize to probabilities
    sample_weights = np.array(sample_weights)
    sample_weights = sample_weights / sample_weights.sum()

    return sample_weights

3. Diffusion-Based Visual Data Synthesis for Tail Categories

Use diffusion models to generate images for underrepresented tail entities:

def synthesize_tail_data(tail_entities, num_samples_per_entity,
                        text_generator, diffusion_model):
    """
    Generate synthetic images for tail entity categories.

    Args:
        tail_entities: list of underrepresented object/concept names
        num_samples_per_entity: how many images to generate per entity
        text_generator: LLM for generating diverse captions
        diffusion_model: DDPM for image generation

    Returns:
        list of (synthetic_image, caption) tuples
    """
    synthetic_data = []

    for entity in tail_entities:
        # Generate diverse text descriptions for this entity
        prompts = []
        for i in range(num_samples_per_entity):
            # Use LLM to generate varied descriptions
            prompt = text_generator(
                f"Generate a diverse visual description of a {entity} "
                f"variation {i+1}. Include scene context, colors, pose.",
                temperature=0.8
            )
            prompts.append(prompt)

        # Use diffusion model to generate images from text
        for prompt in prompts:
            # Add entity keyword to ensure relevance
            full_prompt = f"{entity}, {prompt}"

            # Generate with diffusion model
            image = diffusion_model.generate(
                prompt=full_prompt,
                num_steps=50,
                guidance_scale=7.5,
                seed=None  # Randomize for diversity
            )

            synthetic_data.append((image, full_prompt))

    return synthetic_data

Practical Guidance

When to Use ADR

Training vision-language models on diverse datasets with natural long-tail distributions
Need to improve tail category performance without adding more real data
Models like LLaVA where object/concept distribution is imbalanced
Want to improve generalization on rare visual concepts

When NOT to Use

Datasets already well-balanced across categories
Where synthetic data quality is critical (medical imaging, etc.)
When tail categories should remain underrepresented by design
Very small datasets where rebalancing statistics are unreliable

Hyperparameters & Configuration

Head threshold percentile: 80% (80/20 Pareto principle); adjust based on data skewness
Head downsample ratio: 0.5 (keep 50% of head samples); balance between coverage and tail emphasis
Synthesis samples per entity: 5-10 per tail entity; more diversity helps generalization
Entity detection model: YOLOv8 or similar for object detection
Diffusion steps: 50 steps for quality; 25 for speed
Guidance scale: 7.5 for text-image alignment; higher values (10+) strengthen entity presence

Common Pitfalls

Over-rebalancing: Completely removing head categories hurts overall accuracy; keep some head data
Poor caption quality for synthesis: LLM-generated captions must be descriptive; use temperature 0.7-0.8
Synthetic data mode collapse: Diffusion models may generate similar images; use seed randomization
Entity detection errors: Object detector mistakes propagate; validate detection on sample images
Ignoring statistical significance: Small tail categories may need higher synthesis count for reliability
Validation set bias: Ensure test set reflects original distribution, not rebalanced distribution

Reference

Kingma & Welling. 2014. Auto-Encoding Variational Bayes (VAE foundations).
Ho et al. 2020. Denoising Diffusion Probabilistic Models (DDPM).
Liu et al. 2019. Decoupling Representation and Classifier for Long-Tailed Recognition.
LLaVA project: https://github.com/haotian-liu/LLaVA
Project page: https://vlmlt.github.io/

Related Skills

ADu2021/flow-map-trajectory-tilting

testing

VerifiedTrustedCommunity

Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

testing

VerifiedTrustedCommunity

Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

data-ai

VerifiedTrustedCommunity

Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

devops

VerifiedTrustedCommunity

Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flex-continuous-agent-evolution

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ADu2021/skillXiv.git

# Copy into Claude Code skills folder (global)
cp -r skillXiv/skills/skillxiv-v0.0.2-claude-opus-4.6/adaptive-data-refinement-vlm-long-tail ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ADu2021/skillXiv

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT