hugging-face-evaluation/SKILL.md
Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
npx skillsauth add lidge-jun/cli-jaw-skills hugging-face-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
All paths are relative to this SKILL.md's directory.
cdhere or use full paths before running scripts.
Before using --create-pr, check for existing open PRs to avoid duplicating work for maintainers:
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
If open PRs exist:
Use --help for the latest workflow guidance:
uv run scripts/evaluation_manager.py --help
Standard flow:
get-prs → check for existing open PRsinspect-tables → find table numbers/columnsextract-readme --table N → prints YAML (preview by default)--apply (push) or --create-pr to write changesuv run (PEP 723 header auto-installs deps)HF_TOKEN env var with write-access tokenAA_API_KEY env var.env is loaded automatically if python-dotenv is installedInspect tables (start here):
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"
Extract from README:
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model-name" \
--table N \
[--model-column-index N] \
[--model-name-override "Exact Column Header"] \
[--task-type "text-generation"] \
[--dataset-name "Custom Benchmarks"] \
[--apply | --create-pr]
Import from Artificial Analysis:
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
--creator-slug "creator-name" \
--model-name "model-slug" \
--repo-id "username/model-name" \
[--create-pr]
View / Validate:
uv run scripts/evaluation_manager.py show --repo-id "username/model-name"
uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"
Run evaluation job (inference providers):
hf jobs uv run scripts/inspect_eval_uv.py \
--flavor "cpu-basic|t4-small|a10g-small" \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "model-id" --task "task-name"
Run vLLM evaluation (custom models):
# lighteval with vLLM
uv run scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5"
# inspect-ai with vLLM
uv run scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--task mmlu
# Helper script (auto hardware selection)
uv run scripts/run_vllm_eval_job.py \
--model "model-id" \
--task "leaderboard|mmlu|5" \
--framework lighteval
Requires uv installed and sufficient GPU memory. Verify with nvidia-smi.
Two frameworks available:
suite|task|num_fewshot (e.g. leaderboard|mmlu|5)mmlu, gsm8k, hellaswag, arc_challenge, truthfulqa, winogrande, humanevalKey flags:
--backend accelerate or --backend hf for non-vLLM inference--use-chat-template for instruction-tuned models--trust-remote-code for models with custom code (e.g. Phi-2, Qwen)--tensor-parallel-size N for multi-GPUHardware sizing: <3B → t4-small, 3-13B → a10g-small, 13-34B → a10g-large, 34B+ → a100-large
Submit via HF Jobs by prefixing with hf jobs uv run and adding --flavor/--secrets.
model-index:
- name: Model Name
results:
- task:
type: text-generation
dataset:
name: Benchmark Dataset
type: benchmark_type
metrics:
- name: MMLU
type: mmlu
value: 85.2
source:
name: Source Name
url: https://source-url.com
source.urlget-prs before creating any PRinspect-tables to see table structure--apply or --create-pr--model-column-index over --model-name-override; if using override, match column header text exactly--table N--create-pr when updating models you don't owndevelopment
Native Web UI structured renderer schemas for compose-block drafts, search-results cards, dataframe tables, chart-json charts, and diff output
tools
Unified search hub. Route any web/real-time/X lookup through a 4-tier escalation: built-in web search → cli-jaw browser CDP → progrok Grok OAuth → web-ai (Grok Expert / GPT Pro). Use for: search, 검색, web search, latest news, real-time info, X/Twitter, fact lookup, deep research.
development
UI/UX intent discovery, design vocabulary, product personalities, UX state patterns, typography line break judgment, favicon/product logo design, and logo trust section design. Use when user design direction is vague, when building onboarding/empty/error states, when setting up favicons or product logos, or when referencing a product aesthetic.
development
Canonical owner of module boundary rules, circular dependency detection/prevention, implicit coupling taxonomy, barrel/re-export discipline, and boundary-only defensive programming. Referenced by dev, dev-code-reviewer, dev-backend, dev-frontend stubs.