skills/43-wentorai-research-plugins/skills/tools/code-exec/kaggle-api-guide/SKILL.md
Download datasets, manage competitions and notebooks via Kaggle API
npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research kaggle-api-guideInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Kaggle is the world's largest data science and machine learning community, hosting thousands of datasets, competitions, and computational notebooks. The Kaggle API provides programmatic access to these resources, enabling researchers to download datasets, submit competition entries, manage kernels (notebooks), and explore the Kaggle ecosystem from the command line or scripts.
For academic researchers, Kaggle is a valuable resource for accessing curated, well-documented datasets across diverse domains including healthcare, natural language processing, computer vision, economics, and social sciences. Many published research papers use Kaggle datasets as benchmarks, and the platform's competition infrastructure provides standardized evaluation frameworks for comparing methods.
The Kaggle API is available as a Python CLI tool and library. It requires a free Kaggle account and API token for authentication. The API supports dataset search and download, competition data retrieval, kernel management, and model access.
A free Kaggle API token is required. Generate one from your Kaggle account settings at https://www.kaggle.com/settings.
Download the kaggle.json credentials file and place it in the standard location:
# The kaggle.json file should be at ~/.kaggle/kaggle.json
# It contains your username and key from your Kaggle account settings
mkdir -p ~/.kaggle
# Move your downloaded kaggle.json to ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.json
Alternatively, use environment variables:
export KAGGLE_USERNAME=$KAGGLE_USERNAME
export KAGGLE_KEY=$KAGGLE_KEY
Install the CLI tool:
pip install kaggle
Find datasets by keyword, file type, or license.
# Search for datasets
kaggle datasets list -s "climate change" --sort-by votes
# Search with specific criteria
kaggle datasets list -s "medical imaging" --file-type csv --max-size 1000000
# Download and unzip a dataset
kaggle datasets download -d "heptapod/titanic" --unzip -p ./data/titanic/
# Download a specific file from a dataset
kaggle datasets download -d "yelp-dataset/yelp-dataset" -f "yelp_academic_dataset_review.json" -p ./data/
# List active competitions
kaggle competitions list
# Download competition data (must accept rules on kaggle.com first)
kaggle competitions download -c "house-prices-advanced-regression-techniques" -p ./data/house-prices/
# Submit predictions
kaggle competitions submit -c "house-prices-advanced-regression-techniques" \
-f ./submission.csv -m "Random forest baseline v1"
# Check submission status
kaggle competitions submissions -c "house-prices-advanced-regression-techniques"
# Search for notebooks
kaggle kernels list -s "transformer nlp" --sort-by voteCount
# Pull a notebook to local
kaggle kernels pull "username/notebook-name" -p ./notebooks/
# Push a notebook to Kaggle
kaggle kernels push -p ./my-notebook/
import subprocess
import json
import os
def search_kaggle_datasets(query, sort_by="votes", max_results=10):
"""Search Kaggle datasets and return structured results."""
cmd = [
"kaggle", "datasets", "list",
"-s", query,
"--sort-by", sort_by,
"--max-size", "50000000",
"--csv"
]
result = subprocess.run(cmd, capture_output=True, text=True)
lines = result.stdout.strip().split("\n")
if len(lines) < 2:
return []
headers = lines[0].split(",")
datasets = []
for line in lines[1:max_results + 1]:
values = line.split(",")
dataset = dict(zip(headers, values))
datasets.append(dataset)
return datasets
def download_dataset(dataset_ref, output_dir="./data"):
"""Download a Kaggle dataset by reference."""
os.makedirs(output_dir, exist_ok=True)
cmd = [
"kaggle", "datasets", "download",
"-d", dataset_ref,
"--unzip",
"-p", output_dir
]
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode == 0:
print(f"Downloaded {dataset_ref} to {output_dir}")
else:
print(f"Error: {result.stderr}")
# Search for NLP benchmark datasets
datasets = search_kaggle_datasets("nlp text classification benchmark")
for ds in datasets[:5]:
print(f" {ds.get('ref', 'N/A')}")
print(f" Size: {ds.get('totalBytes', 'N/A')} bytes")
print(f" Votes: {ds.get('voteCount', 'N/A')}")
print()
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()
# Search datasets
datasets = api.dataset_list(search="genomics", sort_by="updated")
for ds in datasets[:5]:
print(f"{ds.ref}: {ds.title} ({ds.size})")
# Get dataset metadata
metadata = api.dataset_view("nih-chest-xrays/data")
print(f"Title: {metadata.title}")
print(f"Size: {metadata.totalBytes}")
print(f"Description: {metadata.description[:200]}")
# Download dataset files
api.dataset_download_files(
"nih-chest-xrays/sample",
path="./data/chest-xrays/",
unzip=True
)
Benchmark Dataset Access: Download well-established datasets used in published research for reproducibility studies. Kaggle hosts canonical versions of many benchmark datasets referenced in ML papers.
Competition as Evaluation Framework: Use Kaggle competitions as standardized evaluation environments with leaderboards and held-out test sets. Submit predictions from novel methods to compare against state-of-the-art approaches.
Data Exploration Notebooks: Search for and pull community notebooks that explore datasets relevant to your research. These often contain valuable preprocessing code, exploratory analysis, and baseline models.
Collaborative Research Datasets: Upload processed research datasets to Kaggle for sharing with collaborators and the broader community, enabling others to reproduce and extend your work.
Cross-Domain Transfer: Search across Kaggle's diverse dataset collection to find datasets from adjacent domains that could be useful for transfer learning or cross-domain validation studies.
kernel-metadata.json file specifying the kernel type, language, and datasetskaggle.json to version control; use environment variables in CI/CD pipelinesdevelopment
Conduct rigorous thematic analysis (TA) of qualitative data following Braun and Clarke's (2006) six-phase framework. Use whenever the user mentions 'thematic analysis', 'TA', 'Braun and Clarke', 'qualitative coding', 'identifying themes', or asks for help analysing interviews, focus groups, open-ended survey responses, or transcripts to identify patterns. Also trigger for questions about inductive vs theoretical coding, semantic vs latent themes, essentialist vs constructionist epistemology, building a thematic map, or writing up a qualitative findings section. Covers all six phases, the four upfront analytic decisions, the 15-point quality checklist, and the five common pitfalls. Produces a Word document write-up and an annotated thematic map. Does NOT cover IPA, grounded theory, discourse analysis, conversation analysis, or narrative analysis — use a different method for those.
development
Guide users through writing a systematic literature review (SLR) following the PRISMA 2020 framework. Use this skill whenever the user mentions 'systematic review', 'systematic literature review', 'SLR', 'PRISMA', 'PRISMA 2020', 'PRISMA flow diagram', 'PRISMA checklist', or asks for help writing, structuring, or auditing a literature review that follows reporting guidelines. Also trigger when the user asks about inclusion/exclusion criteria for a review, search strategies for databases like Scopus/WoS/PubMed, study selection processes, risk of bias assessment, or narrative synthesis for a review paper. This skill covers the full PRISMA 2020 checklist (27 items), produces a Word document manuscript in strict journal article format, generates an annotated PRISMA flow diagram, and enforces APA 7th Edition referencing throughout. It does NOT cover meta-analysis or statistical pooling. By Chuah Kee Man.
testing
Performs placebo-in-time sensitivity analysis with hierarchical null model and optional Bayesian assurance. Use when checking model robustness, verifying lack of pre-intervention effects, or estimating study power.
data-ai
Fit, summarize, plot, and interpret a chosen CausalPy experiment. Use after the causal method has been selected, including when configuring PyMC/sklearn models and scale-aware custom priors.