skills/data-and-science/research/scientific-skills/automated-image-dataset-generation/SKILL.md
Generate large-scale image datasets automatically using web scraping and Large Multimodal Models (LMMs) like Gemini Vision. This skill implements the methodology from the research paper "Automated Image Dataset Generation Using Web Scraping and Large Multimodal Models for Construction Applications" by Gharib & Moselhi. Achieves ~95% accuracy in metadata generation for image classification and object detection tasks.
npx skillsauth add lunartech-x/superpowers automated-image-dataset-generationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill provides a scalable, reusable framework for automatically generating labeled image datasets using web scraping combined with Large Multimodal Models (LMMs) for metadata generation. The methodology addresses the challenge of manual data collection being resource-intensive, error-prone, and time-consuming.
Key Capabilities:
Use this skill when:
Define Target Categories:
Design Search Queries:
# Generate diverse search queries
categories = ["structural steel beam", "steel column construction", "roof truss"]
query_variations = [
f"{cat} {mod}"
for cat in categories
for mod in ["photo", "site", "construction", "building"]
]
Set Collection Parameters:
Implement Multi-Source Scraping:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
def scrape_images(query, num_images=1000):
"""
Scrape images from multiple sources:
- Google Images
- Bing Images
- Domain-specific sites
"""
images = []
# Use appropriate rate limiting
# Respect robots.txt
# Store source URLs for attribution
return images
Image Download and Storage:
def download_images(image_urls, output_dir):
"""
Download images with:
- Duplicate detection (hash-based)
- Format validation
- Resolution filtering
- Metadata preservation
"""
pass
Initial Filtering:
Configure LMM (Gemini Vision or equivalent):
import google.generativeai as genai
genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
model = genai.GenerativeModel('gemini-1.5-flash')
def generate_metadata(image_path, categories):
"""
Use LMM to analyze image and generate metadata
"""
image = PIL.Image.open(image_path)
prompt = f"""
Analyze this image and determine:
1. Does it contain any of these objects: {categories}?
2. If yes, which specific category?
3. Confidence level (high/medium/low)
4. Object location description (for detection tasks)
5. Image quality assessment
Return structured JSON response.
"""
response = model.generate_content([prompt, image])
return parse_response(response.text)
Batch Processing:
def process_dataset(image_dir, categories, batch_size=100):
"""
Process images in batches with:
- Rate limiting
- Error handling
- Progress tracking
- Checkpoint saving
"""
results = []
for batch in get_batches(image_dir, batch_size):
batch_results = [
generate_metadata(img, categories)
for img in batch
]
results.extend(batch_results)
save_checkpoint(results)
return results
Quality Metrics:
Apply Category Rules:
def filter_by_rules(metadata, rules):
"""
Apply domain-specific rules:
- Minimum confidence threshold (e.g., 0.8)
- Category-specific validation
- Cross-reference with search query
"""
filtered = []
for item in metadata:
if item['confidence'] >= rules['min_confidence']:
if validate_category(item, rules):
filtered.append(item)
return filtered
Handle Edge Cases:
Generate Dataset Structure:
dataset/
├── images/
│ ├── category_1/
│ ├── category_2/
│ └── ...
├── annotations/
│ ├── metadata.json
│ └── labels.csv
├── splits/
│ ├── train.txt
│ ├── val.txt
│ └── test.txt
└── README.md
Create Annotation Files:
def create_annotations(filtered_data, output_dir):
"""
Generate standard annotation formats:
- COCO format (for object detection)
- CSV with labels (for classification)
- YOLO format (if needed)
"""
pass
Split Dataset:
Based on the original research:
# Core
pip install requests beautifulsoup4 selenium pillow
# LMM
pip install google-generativeai # or openai for GPT-4V
# Image processing
pip install imagehash opencv-python
# Dataset tools
pip install pandas scikit-learn
tools
Data structure for annotated matrices in single-cell analysis. Use when working with .h5ad files or integrating with the scverse ecosystem. This is the data format skill—for analysis workflows use scanpy; for probabilistic models use scvi-tools; for population-scale queries use cellxgene-census.
testing
Access AlphaFold 200M+ AI-predicted protein structures. Retrieve structures by UniProt ID, download PDB/mmCIF files, analyze confidence metrics (pLDDT, PAE), for drug discovery and structural biology.
development
Access real-time and historical stock market data, forex rates, cryptocurrency prices, commodities, economic indicators, and 50+ technical indicators via the Alpha Vantage API. Use when fetching stock prices (OHLCV), company fundamentals (income statement, balance sheet, cash flow), earnings, options data, market news/sentiment, insider transactions, GDP, CPI, treasury yields, gold/silver/oil prices, Bitcoin/crypto prices, forex exchange rates, or calculating technical indicators (SMA, EMA, MACD, RSI, Bollinger Bands). Requires a free API key from alphavantage.co.
development
This skill should be used for time series machine learning tasks including classification, regression, clustering, forecasting, anomaly detection, segmentation, and similarity search. Use when working with temporal data, sequential patterns, or time-indexed observations requiring specialized algorithms beyond standard ML approaches. Particularly suited for univariate and multivariate time series analysis with scikit-learn compatible APIs.