skills/data-and-science/research/scientific-skills/textual-metadata-dataset-construction/SKILL.md
Automatically construct large-scale image datasets from web sources using multiple textual metadata for semantic expansion and CNN-based filtering. This skill implements the methodology from "Automatic Image Dataset Construction with Multiple Textual Metadata" (IEEE ICME 2016). Reduces dataset bias and improves cross-dataset generalization through query expansion and progressive filtering.
npx skillsauth add lunartech-x/superpowers textual-metadata-dataset-constructionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill provides a framework for automatically collecting diverse, high-quality image datasets from the web using semantic query expansion and progressive CNN-based filtering. The methodology addresses key challenges:
Key Innovation: Uses Google Books Ngrams Corpora for query expansion to capture richer semantic descriptions, then progressively filters using CNNs.
Use this skill when:
Initial Query Definition:
initial_queries = ["dog", "car", "airplane"]
Semantic Expansion with N-gram Corpora:
def expand_query_with_ngrams(query, ngram_data):
"""
Expand query using Google Books Ngrams:
- Find co-occurring terms
- Add synonyms and related concepts
- Include descriptive modifiers
Example: "dog" → ["dog breed", "puppy", "canine",
"dog playing", "dog running", ...]
"""
expansions = []
# Get bigrams containing the query
bigrams = get_ngrams(query, n=2, ngram_data=ngram_data)
# Get trigrams for context
trigrams = get_ngrams(query, n=3, ngram_data=ngram_data)
# Combine and rank by frequency
expansions = rank_by_relevance(bigrams + trigrams)
return expansions
Visual Saliency Filtering:
def filter_expansions(expansions, visual_model):
"""
Remove expansions that are:
- Visually non-salient (abstract concepts)
- Less relevant to visual domain
- Too generic or too specific
Use pre-trained visual model to score saliency
"""
filtered = []
for exp in expansions:
# Check if expansion corresponds to visually identifiable concept
saliency_score = visual_model.predict_saliency(exp)
if saliency_score > threshold:
filtered.append(exp)
return filtered
Multi-Query Image Collection:
def collect_images(expanded_queries, images_per_query=500):
"""
Retrieve images using expanded queries:
- Use multiple search engines
- Collect metadata (source URL, query used)
- Diversify sources to reduce bias
"""
all_images = []
for query in expanded_queries:
images = search_engine.image_search(
query,
num_results=images_per_query
)
for img in images:
img['source_query'] = query
all_images.extend(images)
return all_images
Initial Preprocessing:
def preprocess_images(images):
"""
- Remove duplicates (perceptual hash)
- Validate image format
- Resize to standard dimensions
- Remove corrupted files
"""
pass
Feature Extraction:
def extract_features(images, cnn_model):
"""
Extract deep features using pre-trained CNN
(e.g., VGG, ResNet features from penultimate layer)
"""
features = []
for img in images:
feat = cnn_model.extract_features(img)
features.append(feat)
return np.array(features)
Cluster Analysis:
from sklearn.cluster import KMeans
def cluster_and_filter(features, images, n_clusters=10):
"""
Cluster images by visual similarity:
- Identify core clusters (likely relevant)
- Remove outlier clusters (likely noise)
- Keep images from dense, coherent clusters
"""
kmeans = KMeans(n_clusters=n_clusters)
clusters = kmeans.fit_predict(features)
# Analyze cluster statistics
cluster_stats = analyze_clusters(clusters, features)
# Remove outlier clusters (low density, high variance)
valid_clusters = [
c for c in cluster_stats
if c['density'] > threshold and c['coherence'] > min_coherence
]
filtered_images = [
img for img, c in zip(images, clusters)
if c in valid_clusters
]
return filtered_images
Initial CNN Training:
def train_initial_classifier(clustered_images, num_classes):
"""
Train initial CNN classifier on clustered data:
- Use cluster assignments as pseudo-labels
- Fine-tune pre-trained model
"""
model = load_pretrained_cnn()
model = fine_tune(model, clustered_images)
return model
Progressive Refinement:
def progressive_filtering(images, model, iterations=3):
"""
Iteratively refine dataset:
1. Classify all images with current model
2. Remove low-confidence predictions
3. Retrain model on refined set
4. Repeat
"""
for i in range(iterations):
# Predict on all images
predictions = model.predict(images)
# Filter by confidence
confident_samples = [
(img, pred) for img, pred in zip(images, predictions)
if pred['confidence'] > confidence_threshold(i)
]
# Retrain on refined set
model = train_classifier(confident_samples)
images = [s[0] for s in confident_samples]
return images, model
Quality Verification:
def verify_dataset_quality(dataset, test_set):
"""
Evaluate dataset quality:
- Cross-dataset generalization (test on STL-10, CIFAR-10)
- Class balance analysis
- Diversity metrics
"""
# Train classifier on generated dataset
model = train_classifier(dataset)
# Test on external datasets
stl10_accuracy = evaluate(model, stl10_test)
cifar10_accuracy = evaluate(model, cifar10_test)
return {
'cross_dataset_acc': (stl10_accuracy + cifar10_accuracy) / 2,
'class_balance': compute_balance(dataset),
'diversity': compute_diversity(dataset)
}
Export Dataset:
dataset/
├── train/
│ ├── class_1/
│ ├── class_2/
│ └── ...
├── val/
├── test/
├── metadata.json
└── dataset_stats.md
Query Expansion:
Clustering:
Progressive Filtering:
Validation:
Based on original research:
# Deep learning
pip install torch torchvision # or tensorflow
# Clustering
pip install scikit-learn
# Image processing
pip install pillow opencv-python
# N-gram data
# Download Google Books Ngrams: https://storage.googleapis.com/books/ngrams/books/datasetsv3.html
tools
Data structure for annotated matrices in single-cell analysis. Use when working with .h5ad files or integrating with the scverse ecosystem. This is the data format skill—for analysis workflows use scanpy; for probabilistic models use scvi-tools; for population-scale queries use cellxgene-census.
testing
Access AlphaFold 200M+ AI-predicted protein structures. Retrieve structures by UniProt ID, download PDB/mmCIF files, analyze confidence metrics (pLDDT, PAE), for drug discovery and structural biology.
development
Access real-time and historical stock market data, forex rates, cryptocurrency prices, commodities, economic indicators, and 50+ technical indicators via the Alpha Vantage API. Use when fetching stock prices (OHLCV), company fundamentals (income statement, balance sheet, cash flow), earnings, options data, market news/sentiment, insider transactions, GDP, CPI, treasury yields, gold/silver/oil prices, Bitcoin/crypto prices, forex exchange rates, or calculating technical indicators (SMA, EMA, MACD, RSI, Bollinger Bands). Requires a free API key from alphavantage.co.
development
This skill should be used for time series machine learning tasks including classification, regression, clustering, forecasting, anomaly detection, segmentation, and similarity search. Use when working with temporal data, sequential patterns, or time-indexed observations requiring specialized algorithms beyond standard ML approaches. Particularly suited for univariate and multivariate time series analysis with scikit-learn compatible APIs.