Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

lunartech-x/automated-image-dataset-generation

Name: automated-image-dataset-generation
Author: lunartech-x

skills/data-and-science/research/scientific-skills/automated-image-dataset-generation/SKILL.md

npx skillsauth add lunartech-x/superpowers automated-image-dataset-generation

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Automated Image Dataset Generation with LMMs

Overview

This skill provides a scalable, reusable framework for automatically generating labeled image datasets using web scraping combined with Large Multimodal Models (LMMs) for metadata generation. The methodology addresses the challenge of manual data collection being resource-intensive, error-prone, and time-consuming.

Key Capabilities:

Automated web image collection at scale (50,000+ images)
LMM-powered metadata generation with ~95% accuracy
Rule-based filtering for domain-specific categorization
Structured output for object detection and classification tasks

When to Use This Skill

Use this skill when:

Building custom image datasets for machine learning applications
Collecting domain-specific images that aren't available in existing datasets
Needing automated image labeling/metadata generation
Working on object detection or image classification projects
Manual annotation is too expensive or time-consuming
Requiring large-scale training data for computer vision models

Core Workflow

Phase 1: Query Design and Planning

Define Target Categories:
- Identify specific objects/classes to collect
- Create hierarchical category structure if needed
- Example categories: beams, columns, trusses, steel frames

Design Search Queries:

# Generate diverse search queries
categories = ["structural steel beam", "steel column construction", "roof truss"]
query_variations = [
    f"{cat} {mod}" 
    for cat in categories 
    for mod in ["photo", "site", "construction", "building"]
]

Set Collection Parameters:
- Target image count per category
- Image quality thresholds (resolution, format)
- Source diversity requirements

Phase 2: Web Scraping

Implement Multi-Source Scraping:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver

def scrape_images(query, num_images=1000):
    """
    Scrape images from multiple sources:
    - Google Images
    - Bing Images
    - Domain-specific sites
    """
    images = []
    
    # Use appropriate rate limiting
    # Respect robots.txt
    # Store source URLs for attribution
    
    return images

Image Download and Storage:

def download_images(image_urls, output_dir):
    """
    Download images with:
    - Duplicate detection (hash-based)
    - Format validation
    - Resolution filtering
    - Metadata preservation
    """
    pass

Initial Filtering:
- Remove corrupted/invalid images
- Filter by minimum resolution (e.g., 224x224)
- Deduplicate using perceptual hashing

Phase 3: LMM-Based Metadata Generation

Configure LMM (Gemini Vision or equivalent):

import google.generativeai as genai

genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
model = genai.GenerativeModel('gemini-1.5-flash')

def generate_metadata(image_path, categories):
    """
    Use LMM to analyze image and generate metadata
    """
    image = PIL.Image.open(image_path)
    
    prompt = f"""
    Analyze this image and determine:
    1. Does it contain any of these objects: {categories}?
    2. If yes, which specific category?
    3. Confidence level (high/medium/low)
    4. Object location description (for detection tasks)
    5. Image quality assessment
    
    Return structured JSON response.
    """
    
    response = model.generate_content([prompt, image])
    return parse_response(response.text)

Batch Processing:

def process_dataset(image_dir, categories, batch_size=100):
    """
    Process images in batches with:
    - Rate limiting
    - Error handling
    - Progress tracking
    - Checkpoint saving
    """
    results = []
    for batch in get_batches(image_dir, batch_size):
        batch_results = [
            generate_metadata(img, categories) 
            for img in batch
        ]
        results.extend(batch_results)
        save_checkpoint(results)
    return results

Quality Metrics:
- Track LMM confidence scores
- Flag low-confidence predictions for review
- Calculate category distribution

Phase 4: Rule-Based Filtering

Apply Category Rules:

def filter_by_rules(metadata, rules):
    """
    Apply domain-specific rules:
    - Minimum confidence threshold (e.g., 0.8)
    - Category-specific validation
    - Cross-reference with search query
    """
    filtered = []
    for item in metadata:
        if item['confidence'] >= rules['min_confidence']:
            if validate_category(item, rules):
                filtered.append(item)
    return filtered

Handle Edge Cases:
- Multi-label images (multiple categories)
- Ambiguous classifications
- Partial object visibility

Phase 5: Dataset Finalization

Generate Dataset Structure:

dataset/
├── images/
│   ├── category_1/
│   ├── category_2/
│   └── ...
├── annotations/
│   ├── metadata.json
│   └── labels.csv
├── splits/
│   ├── train.txt
│   ├── val.txt
│   └── test.txt
└── README.md

Create Annotation Files:

def create_annotations(filtered_data, output_dir):
    """
    Generate standard annotation formats:
    - COCO format (for object detection)
    - CSV with labels (for classification)
    - YOLO format (if needed)
    """
    pass

Split Dataset:
- Train/Val/Test split (typically 70/15/15)
- Stratified splitting by category
- Ensure no data leakage

Best Practices

Web Scraping

Respect rate limits: 1-2 requests per second
Rotate user agents: Avoid detection
Use proxies: For large-scale collection
Cache responses: Avoid redundant downloads
Store source URLs: For attribution and verification

LMM Usage

Use appropriate prompts: Be specific about expected output format
Batch processing: Optimize API costs
Handle API errors: Implement retry logic with exponential backoff
Validate responses: Parse and validate JSON responses

Data Quality

Verify sample manually: Check 100-200 random samples
Calculate inter-annotator agreement: If using multiple LMMs
Document accuracy metrics: Report precision/recall per category
Version your dataset: Track changes over time

Legal & Ethical

Check image licenses: Prefer CC-licensed content
Respect robots.txt: Don't scrape disallowed pages
Attribute sources: Maintain source URLs
Consider privacy: Filter personal/sensitive content

Expected Results

Based on the original research:

Collection scale: 50,000+ raw images
After filtering: ~5% relevant images (domain-specific)
Metadata accuracy: 94.8%
Categories: Successfully identifies 5+ distinct categories

Integration with Other Skills

scientific-schematics: Generate dataset visualization diagrams
exploratory-data-analysis: Analyze dataset statistics
pytorch: Train models on generated dataset
matplotlib/seaborn: Visualize class distributions

Dependencies

# Core
pip install requests beautifulsoup4 selenium pillow

# LMM
pip install google-generativeai  # or openai for GPT-4V

# Image processing
pip install imagehash opencv-python

# Dataset tools
pip install pandas scikit-learn

References

Gharib, S., & Moselhi, O. (2025). Automated Image Dataset Generation Using Web Scraping and Large Multimodal Models for Construction Applications. ISARC 2025.

lunartech-x/automated-image-dataset-generation

skills/data-and-science/research/scientific-skills/automated-image-dataset-generation/SKILL.md

Generate large-scale image datasets automatically using web scraping and Large Multimodal Models (LMMs) like Gemini Vision. This skill implements the methodology from the research paper "Automated Image Dataset Generation Using Web Scraping and Large Multimodal Models for Construction Applications" by Gharib & Moselhi. Achieves ~95% accuracy in metadata generation for image classification and object detection tasks.

13 stars

development

Updated Apr 6, 2026

$ install --global

skillsauth

npx skillsauth add lunartech-x/superpowers automated-image-dataset-generation

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 6, 2026, 10:04 PM73.2s1 file scanned

SKILL.md

name:: automated-image-dataset-generation
description:: Generate large-scale image datasets automatically using web scraping and Large Multimodal Models (LMMs) like Gemini Vision. This skill implements the methodology from the research paper "Automated Image Dataset Generation Using Web Scraping and Large Multimodal Models for Construction Applications" by Gharib & Moselhi. Achieves ~95% accuracy in metadata generation for image classification and object detection tasks.
allowed-tools:: [Read, Write, Edit, Bash, WebFetch, Browser]
license:: MIT license
skill-author:: Adapted from Gharib & Moselhi (ISARC 2025)
paper-source:: Automated Image Dataset Generation Using Web Scraping and LMMs for Construction Applications

Automated Image Dataset Generation with LMMs

Overview

Key Capabilities:

Automated web image collection at scale (50,000+ images)
LMM-powered metadata generation with ~95% accuracy
Rule-based filtering for domain-specific categorization
Structured output for object detection and classification tasks

When to Use This Skill

Use this skill when:

Building custom image datasets for machine learning applications
Collecting domain-specific images that aren't available in existing datasets
Needing automated image labeling/metadata generation
Working on object detection or image classification projects
Manual annotation is too expensive or time-consuming
Requiring large-scale training data for computer vision models

Core Workflow

Phase 1: Query Design and Planning

Define Target Categories:
- Identify specific objects/classes to collect
- Create hierarchical category structure if needed
- Example categories: beams, columns, trusses, steel frames

Design Search Queries:

# Generate diverse search queries
categories = ["structural steel beam", "steel column construction", "roof truss"]
query_variations = [
    f"{cat} {mod}" 
    for cat in categories 
    for mod in ["photo", "site", "construction", "building"]
]

Set Collection Parameters:
- Target image count per category
- Image quality thresholds (resolution, format)
- Source diversity requirements

Phase 2: Web Scraping

Implement Multi-Source Scraping:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver

def scrape_images(query, num_images=1000):
    """
    Scrape images from multiple sources:
    - Google Images
    - Bing Images
    - Domain-specific sites
    """
    images = []
    
    # Use appropriate rate limiting
    # Respect robots.txt
    # Store source URLs for attribution
    
    return images

Image Download and Storage:

def download_images(image_urls, output_dir):
    """
    Download images with:
    - Duplicate detection (hash-based)
    - Format validation
    - Resolution filtering
    - Metadata preservation
    """
    pass

Initial Filtering:
- Remove corrupted/invalid images
- Filter by minimum resolution (e.g., 224x224)
- Deduplicate using perceptual hashing

Phase 3: LMM-Based Metadata Generation

Configure LMM (Gemini Vision or equivalent):

import google.generativeai as genai

genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
model = genai.GenerativeModel('gemini-1.5-flash')

def generate_metadata(image_path, categories):
    """
    Use LMM to analyze image and generate metadata
    """
    image = PIL.Image.open(image_path)
    
    prompt = f"""
    Analyze this image and determine:
    1. Does it contain any of these objects: {categories}?
    2. If yes, which specific category?
    3. Confidence level (high/medium/low)
    4. Object location description (for detection tasks)
    5. Image quality assessment
    
    Return structured JSON response.
    """
    
    response = model.generate_content([prompt, image])
    return parse_response(response.text)

Batch Processing:

def process_dataset(image_dir, categories, batch_size=100):
    """
    Process images in batches with:
    - Rate limiting
    - Error handling
    - Progress tracking
    - Checkpoint saving
    """
    results = []
    for batch in get_batches(image_dir, batch_size):
        batch_results = [
            generate_metadata(img, categories) 
            for img in batch
        ]
        results.extend(batch_results)
        save_checkpoint(results)
    return results

Quality Metrics:
- Track LMM confidence scores
- Flag low-confidence predictions for review
- Calculate category distribution

Phase 4: Rule-Based Filtering

Apply Category Rules:

def filter_by_rules(metadata, rules):
    """
    Apply domain-specific rules:
    - Minimum confidence threshold (e.g., 0.8)
    - Category-specific validation
    - Cross-reference with search query
    """
    filtered = []
    for item in metadata:
        if item['confidence'] >= rules['min_confidence']:
            if validate_category(item, rules):
                filtered.append(item)
    return filtered

Handle Edge Cases:
- Multi-label images (multiple categories)
- Ambiguous classifications
- Partial object visibility

Phase 5: Dataset Finalization

Generate Dataset Structure:

dataset/
├── images/
│   ├── category_1/
│   ├── category_2/
│   └── ...
├── annotations/
│   ├── metadata.json
│   └── labels.csv
├── splits/
│   ├── train.txt
│   ├── val.txt
│   └── test.txt
└── README.md

Create Annotation Files:

def create_annotations(filtered_data, output_dir):
    """
    Generate standard annotation formats:
    - COCO format (for object detection)
    - CSV with labels (for classification)
    - YOLO format (if needed)
    """
    pass

Split Dataset:
- Train/Val/Test split (typically 70/15/15)
- Stratified splitting by category
- Ensure no data leakage

Best Practices

Web Scraping

Respect rate limits: 1-2 requests per second
Rotate user agents: Avoid detection
Use proxies: For large-scale collection
Cache responses: Avoid redundant downloads
Store source URLs: For attribution and verification

LMM Usage

Use appropriate prompts: Be specific about expected output format
Batch processing: Optimize API costs
Handle API errors: Implement retry logic with exponential backoff
Validate responses: Parse and validate JSON responses

Data Quality

Verify sample manually: Check 100-200 random samples
Calculate inter-annotator agreement: If using multiple LMMs
Document accuracy metrics: Report precision/recall per category
Version your dataset: Track changes over time

Legal & Ethical

Check image licenses: Prefer CC-licensed content
Respect robots.txt: Don't scrape disallowed pages
Attribute sources: Maintain source URLs
Consider privacy: Filter personal/sensitive content

Expected Results

Based on the original research:

Collection scale: 50,000+ raw images
After filtering: ~5% relevant images (domain-specific)
Metadata accuracy: 94.8%
Categories: Successfully identifies 5+ distinct categories

Integration with Other Skills

scientific-schematics: Generate dataset visualization diagrams
exploratory-data-analysis: Analyze dataset statistics
pytorch: Train models on generated dataset
matplotlib/seaborn: Visualize class distributions

Dependencies

# Core
pip install requests beautifulsoup4 selenium pillow

# LMM
pip install google-generativeai  # or openai for GPT-4V

# Image processing
pip install imagehash opencv-python

# Dataset tools
pip install pandas scikit-learn

References

Gharib, S., & Moselhi, O. (2025). Automated Image Dataset Generation Using Web Scraping and Large Multimodal Models for Construction Applications. ISARC 2025.

Related Skills

lunartech-x/anndata

tools

VerifiedTrustedCommunity

Data structure for annotated matrices in single-cell analysis. Use when working with .h5ad files or integrating with the scverse ecosystem. This is the data format skill—for analysis workflows use scanpy; for probabilistic models use scvi-tools; for population-scale queries use cellxgene-census.

14SKILL.mdUpdated May 16, 2026

lunartech-x/alphafold-db

testing

VerifiedTrustedCommunity

Access AlphaFold 200M+ AI-predicted protein structures. Retrieve structures by UniProt ID, download PDB/mmCIF files, analyze confidence metrics (pLDDT, PAE), for drug discovery and structural biology.

14SKILL.mdUpdated May 16, 2026

lunartech-x/alphafold-db

lunartech-x/alpha-vantage

development

VerifiedTrustedCommunity

Access real-time and historical stock market data, forex rates, cryptocurrency prices, commodities, economic indicators, and 50+ technical indicators via the Alpha Vantage API. Use when fetching stock prices (OHLCV), company fundamentals (income statement, balance sheet, cash flow), earnings, options data, market news/sentiment, insider transactions, GDP, CPI, treasury yields, gold/silver/oil prices, Bitcoin/crypto prices, forex exchange rates, or calculating technical indicators (SMA, EMA, MACD, RSI, Bollinger Bands). Requires a free API key from alphavantage.co.

14SKILL.mdUpdated May 16, 2026

lunartech-x/alpha-vantage

lunartech-x/aeon

development

VerifiedTrustedCommunity

This skill should be used for time series machine learning tasks including classification, regression, clustering, forecasting, anomaly detection, segmentation, and similarity search. Use when working with temporal data, sequential patterns, or time-indexed observations requiring specialized algorithms beyond standard ML approaches. Particularly suited for univariate and multivariate time series analysis with scikit-learn compatible APIs.

14SKILL.mdUpdated May 16, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/lunartech-x/superpowers.git

# Copy into Claude Code skills folder (global)
cp -r superpowers/skills/data-and-science/research/scientific-skills/automated-image-dataset-generation ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

lunartech-x/superpowers

13 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT