skills/hugging-face-community-evals/SKILL.md
Run local evaluations for Hugging Face Hub models with inspect-ai or lighteval.
npx skillsauth add ranbot-ai/awesome-skills hugging-face-community-evalsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill for local model evaluation, backend selection, and GPU smoke tests outside the Hugging Face Jobs workflow.
This skill is for running evaluations against models on the Hugging Face Hub on local hardware.
It covers:
inspect-ai with local inferencelighteval with local inferencevllm, Hugging Face Transformers, and accelerateIt does not cover:
model-index edits.eval_results generation or publishingIf the user wants to run the same eval remotely on Hugging Face Jobs, hand off to the hugging-face-jobs skill and pass it one of the local scripts in this skill.
If the user wants to publish results into the community evals workflow, stop after generating the evaluation run and hand off that publishing step to ~/code/community-evals.
All paths below are relative to the directory containing this
SKILL.md.
| Use case | Script |
|---|---|
| Local inspect-ai eval on a Hub model via inference providers | scripts/inspect_eval_uv.py |
| Local GPU eval with inspect-ai using vllm or Transformers | scripts/inspect_vllm_uv.py |
| Local GPU eval with lighteval using vllm or accelerate | scripts/lighteval_vllm_uv.py |
| Extra command patterns | examples/USAGE_EXAMPLES.md |
uv run for local execution.HF_TOKEN for gated/private models.uv --version
printenv HF_TOKEN >/dev/null
nvidia-smi
If nvidia-smi is unavailable, either:
scripts/inspect_eval_uv.py for lighter provider-backed evaluation, orhugging-face-jobs skill if the user wants remote compute.inspect-ai when you want explicit task control and inspect-native flows.lighteval when the benchmark is naturally expressed as a lighteval task string, especially leaderboard-style tasks.vllm for throughput on supported architectures.--backend hf) or accelerate as compatibility fallbacks.inspect-ai: add --limit 10 or similar.lighteval: add --max-samples 10.hugging-face-jobs with the same script + args.Best when the model is already supported by Hugging Face Inference Providers and you want the lowest local setup overhead.
uv run scripts/inspect_eval_uv.py \
--model meta-llama/Llama-3.2-1B \
--task mmlu \
--limit 20
Use this path when:
inspect-evalsBest when you need to load the Hub model directly, use vllm, or fall back to Transformers for unsupported architectures.
Local GPU:
uv run scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--task gsm8k \
--limit 20
Transformers fallback:
uv run scripts/inspect_vllm_uv.py \
--model microsoft/phi-2 \
--task mmlu \
--backend hf \
--trust-remote-code \
--limit 20
Best when the task is naturally expressed as a lighteval task string, especially Open LLM Leaderboard style benchmarks.
Local GPU:
uv run scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-3B-Instruct \
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5" \
--max-samples 20 \
--use-chat-template
accelerate fallback:
uv run scripts/lighteval_vllm_uv.py \
--model microsoft/phi-2 \
--tasks "leaderboard|mmlu|5" \
--backend accelerate \
--trust-remote-code \
--max-samples 20
This skill intentionally stops at local execution and backend selection.
If the user wants to:
then switch to the hugging-face-jobs skill and pass it one of these scripts plus the chosen arguments.
inspect-ai examples:
mmlugsm8khellaswagarc_challengetruthfulqawinograndehumanevallighteval task strings use suite|task|num_fewshot:
leaderboard|mmlu|5leaderboard|gsm8k|5leaderboard|arc_challenge|25lighteval|hellaswag|0Multiple lighteval tasks can be comma-separated in --tasks.
inspect_vllm_uv.py --backend vllm for fast GPUdevelopment
Production-grade Android app development guide covering native (Kotlin/Java), cross-platform (Flutter, RN, KMM), and hybrid architectures.
testing
Plan, orchestrate, and adversarially verify parallel AI coding agents with a dynamic multi-agent workflow engine.
development
Generate professional, ATS-optimized CVs for FlowCV, Canva, Google Docs, or Word. Handles multi-source merging, JD targeting, seniority adaptation, and humanized rewriting. Outputs paste-ready text wi
tools
Generate hand-drawn 16:9 article illustrations with the Grav character IP, sparse annotations, and absurd but clear visual metaphors.