Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

lidge-jun/hugging-face-evaluation

Name: hugging-face-evaluation
Author: lidge-jun

hugging-face-evaluation/SKILL.md

npx skillsauth add lidge-jun/cli-jaw-skills hugging-face-evaluation

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Hugging Face Evaluation Skill

All paths are relative to this SKILL.md's directory. cd here or use full paths before running scripts.

PR Safety — Check Before Creating

Before using --create-pr, check for existing open PRs to avoid duplicating work for maintainers:

uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"

If open PRs exist:

Warn the user and show existing PR URLs
Only proceed if the user explicitly confirms creating another PR

Core Workflow

Use --help for the latest workflow guidance:

uv run scripts/evaluation_manager.py --help

Standard flow:

get-prs → check for existing open PRs
inspect-tables → find table numbers/columns
extract-readme --table N → prints YAML (preview by default)
Add --apply (push) or --create-pr to write changes

Prerequisites

Use uv run (PEP 723 header auto-installs deps)
Set HF_TOKEN env var with write-access token
For Artificial Analysis: set AA_API_KEY env var
.env is loaded automatically if python-dotenv is installed

Commands Reference

Inspect tables (start here):

uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"

Extract from README:

uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model-name" \
  --table N \
  [--model-column-index N] \
  [--model-name-override "Exact Column Header"] \
  [--task-type "text-generation"] \
  [--dataset-name "Custom Benchmarks"] \
  [--apply | --create-pr]

Import from Artificial Analysis:

AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "creator-name" \
  --model-name "model-slug" \
  --repo-id "username/model-name" \
  [--create-pr]

View / Validate:

uv run scripts/evaluation_manager.py show --repo-id "username/model-name"
uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"

Run evaluation job (inference providers):

hf jobs uv run scripts/inspect_eval_uv.py \
  --flavor "cpu-basic|t4-small|a10g-small" \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "model-id" --task "task-name"

Run vLLM evaluation (custom models):

# lighteval with vLLM
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --tasks "leaderboard|mmlu|5"

# inspect-ai with vLLM
uv run scripts/inspect_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --task mmlu

# Helper script (auto hardware selection)
uv run scripts/run_vllm_eval_job.py \
  --model "model-id" \
  --task "leaderboard|mmlu|5" \
  --framework lighteval

vLLM Evaluation

Requires uv installed and sufficient GPU memory. Verify with nvidia-smi.

Two frameworks available:

lighteval — HuggingFace's library; task format: suite|task|num_fewshot (e.g. leaderboard|mmlu|5)
inspect-ai — UK AISI framework; tasks: mmlu, gsm8k, hellaswag, arc_challenge, truthfulqa, winogrande, humaneval

Key flags:

--backend accelerate or --backend hf for non-vLLM inference
--use-chat-template for instruction-tuned models
--trust-remote-code for models with custom code (e.g. Phi-2, Qwen)
--tensor-parallel-size N for multi-GPU

Hardware sizing: <3B → t4-small, 3-13B → a10g-small, 13-34B → a10g-large, 34B+ → a100-large

Submit via HF Jobs by prefixing with hf jobs uv run and adding --flavor/--secrets.

Model-Index Format

model-index:
  - name: Model Name
    results:
      - task:
          type: text-generation
        dataset:
          name: Benchmark Dataset
          type: benchmark_type
        metrics:
          - name: MMLU
            type: mmlu
            value: 85.2
        source:
          name: Source Name
          url: https://source-url.com

Use plain text for model name (no markdown formatting)
URLs belong only in source.url

Best Practices

Run get-prs before creating any PR
Start with inspect-tables to see table structure
Preview YAML output before using --apply or --create-pr
Prefer --model-column-index over --model-name-override; if using override, match column header text exactly
For multi-table READMEs, specify --table N
Use --create-pr when updating models you don't own
One model per repo — only add the main model's results

References

CLI examples and detailed usage — full Method 1-4 walkthroughs, common patterns, model name matching
Troubleshooting — error messages and solutions

lidge-jun/hugging-face-evaluation

hugging-face-evaluation/SKILL.md

Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.

4 stars

development

Updated Apr 24, 2026

$ install --global

skillsauth

npx skillsauth add lidge-jun/cli-jaw-skills hugging-face-evaluation

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 5:43 PM203.7s15 files scanned

SKILL.md

name:: hugging-face-evaluation
description:: Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.

Hugging Face Evaluation Skill

All paths are relative to this SKILL.md's directory. cd here or use full paths before running scripts.

PR Safety — Check Before Creating

Before using --create-pr, check for existing open PRs to avoid duplicating work for maintainers:

uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"

If open PRs exist:

Warn the user and show existing PR URLs
Only proceed if the user explicitly confirms creating another PR

Core Workflow

Use --help for the latest workflow guidance:

uv run scripts/evaluation_manager.py --help

Standard flow:

get-prs → check for existing open PRs
inspect-tables → find table numbers/columns
extract-readme --table N → prints YAML (preview by default)
Add --apply (push) or --create-pr to write changes

Prerequisites

Use uv run (PEP 723 header auto-installs deps)
Set HF_TOKEN env var with write-access token
For Artificial Analysis: set AA_API_KEY env var
.env is loaded automatically if python-dotenv is installed

Commands Reference

Inspect tables (start here):

uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"

Extract from README:

uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model-name" \
  --table N \
  [--model-column-index N] \
  [--model-name-override "Exact Column Header"] \
  [--task-type "text-generation"] \
  [--dataset-name "Custom Benchmarks"] \
  [--apply | --create-pr]

Import from Artificial Analysis:

AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "creator-name" \
  --model-name "model-slug" \
  --repo-id "username/model-name" \
  [--create-pr]

View / Validate:

uv run scripts/evaluation_manager.py show --repo-id "username/model-name"
uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"

Run evaluation job (inference providers):

hf jobs uv run scripts/inspect_eval_uv.py \
  --flavor "cpu-basic|t4-small|a10g-small" \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "model-id" --task "task-name"

Run vLLM evaluation (custom models):

# lighteval with vLLM
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --tasks "leaderboard|mmlu|5"

# inspect-ai with vLLM
uv run scripts/inspect_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --task mmlu

# Helper script (auto hardware selection)
uv run scripts/run_vllm_eval_job.py \
  --model "model-id" \
  --task "leaderboard|mmlu|5" \
  --framework lighteval

vLLM Evaluation

Requires uv installed and sufficient GPU memory. Verify with nvidia-smi.

Two frameworks available:

lighteval — HuggingFace's library; task format: suite|task|num_fewshot (e.g. leaderboard|mmlu|5)
inspect-ai — UK AISI framework; tasks: mmlu, gsm8k, hellaswag, arc_challenge, truthfulqa, winogrande, humaneval

Key flags:

--backend accelerate or --backend hf for non-vLLM inference
--use-chat-template for instruction-tuned models
--trust-remote-code for models with custom code (e.g. Phi-2, Qwen)
--tensor-parallel-size N for multi-GPU

Hardware sizing: <3B → t4-small, 3-13B → a10g-small, 13-34B → a10g-large, 34B+ → a100-large

Submit via HF Jobs by prefixing with hf jobs uv run and adding --flavor/--secrets.

Model-Index Format

model-index:
  - name: Model Name
    results:
      - task:
          type: text-generation
        dataset:
          name: Benchmark Dataset
          type: benchmark_type
        metrics:
          - name: MMLU
            type: mmlu
            value: 85.2
        source:
          name: Source Name
          url: https://source-url.com

Use plain text for model name (no markdown formatting)
URLs belong only in source.url

Best Practices

Run get-prs before creating any PR
Start with inspect-tables to see table structure
Preview YAML output before using --apply or --create-pr
Prefer --model-column-index over --model-name-override; if using override, match column header text exactly
For multi-table READMEs, specify --table N
Use --create-pr when updating models you don't own
One model per repo — only add the main model's results

References

CLI examples and detailed usage — full Method 1-4 walkthroughs, common patterns, model name matching
Troubleshooting — error messages and solutions

Related Skills

lidge-jun/codex-imagegen

tools

VerifiedTrustedCommunity

Use only on the Codex CLI for native image generation or image editing without an API key. Save final PNG files under ~/.cli-jaw/uploads, report web-ready absolute-path markdown, and send to Telegram or Discord only when explicitly requested.

5SKILL.mdUpdated Jul 10, 2026

lidge-jun/codex-imagegen

lidge-jun/repo-map

tools

VerifiedTrustedCommunity

Ranked repository structure map via `cli-jaw map`. Use for codebase overview, structure map, symbol overview, unfamiliar codebase exploration, architecture orientation. Triggers: repo map, structure map, codebase overview, 와꾸, project structure, unfamiliar code.

5SKILL.mdUpdated Jul 7, 2026

lidge-jun/design

tools

VerifiedTrustedCommunity

cli-jaw Design workspace: create, preview, run, and export design pages from the right sidebar. Covers panel UX, direct-write workflow, artifact lifecycle, wireframe generation, design system, and Open Design adapter.

5SKILL.mdUpdated Jul 5, 2026

lidge-jun/dev-devops

development

VerifiedTrustedCommunity

MUST USE for infrastructure and delivery work — container builds, deploy pipelines, Kubernetes, Infrastructure as Code, SRE foundations, edge/serverless, ML infrastructure. Triggers: Dockerfile, K8s manifests, CI/CD pipeline, Terraform/IaC, release/deploy, devops/infra/deploy or release_cd task_tags.

5SKILL.mdUpdated Jun 19, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/lidge-jun/cli-jaw-skills.git

# Copy into Claude Code skills folder (global)
cp -r cli-jaw-skills/hugging-face-evaluation ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

lidge-jun/cli-jaw-skills

4 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT