hebrew-llm-eval-suite/SKILL.md
Benchmark and compare LLMs on Hebrew reasoning, comprehension, sentiment, translation, and Israeli cultural knowledge. Wraps the HuggingFace Open Hebrew LLM Leaderboard tasks (HeQ reading comprehension, HebrewSentiment, Hebrew Winograd, NeuLabs-TedTalks translation) plus DictaLM 3.0 benchmark tasks (Summarization, Nikud diacritization, Israeli Trivia) into a reproducible evaluation harness. Runs evals against Claude, GPT, Gemini, AI21 Jamba, DictaLM, Llama, and local HuggingFace models. Produces comparison scorecards in JSON and markdown with per-task breakdowns. Use when choosing an LLM for a Hebrew product, answering procurement questions about Hebrew performance, validating a fine-tuned Hebrew model, or tracking Hebrew regressions after a model upgrade. Do NOT use for Arabic NLP evaluation, speech recognition benchmarking (use ivrit.ai leaderboard for ASR), or general English LLM benchmarks.
npx skillsauth add skills-il/developer-tools hebrew-llm-eval-suiteInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Israeli product teams pick LLMs blind. There is no standardized Hebrew benchmark that a PM can run in an afternoon to compare Claude against GPT against DictaLM against AI21 Jamba on their actual use case. The HuggingFace Open Hebrew LLM Leaderboard exists but is built for base models and few-shot prompts, not for API-hosted chat models. DictaLM publishes benchmark results but only for its own suite. Teams end up guessing, testing informally, or trusting marketing claims. The result is costly model switches after launch, or shipping Hebrew products on models that silently fail on native speakers.
Different benchmarks test different things. Choose the smallest set that covers your actual use case.
| Benchmark | HuggingFace ID | What it tests | When to use |
|-----------|---------------|---------------|-------------|
| HeQ (Hebrew Question Answering) | Etelis/HeQ_v1 (HF mirror); canonical at github.com/NNLP-IL/Hebrew-Question-Answering-Dataset | Reading comprehension, extractive QA on Hebrew Wikipedia and Geektime articles. 30,147 questions | Any product that answers questions over Hebrew text: search, RAG, support, research assistants |
| HebrewSentiment | HebArabNlpProject/HebrewSentiment | Sentiment classification (positive, negative, neutral). 41,305 samples. License CC-BY-4.0 | Social media analysis, review classification, product feedback |
| Hebrew Winograd | Community port of Winograd Schema Challenge (cs.ubc.ca/~vshwartz/resources/winograd_he.jsonl) | Pronoun resolution requiring world knowledge. Reasoning-heavy | Any product that needs nuanced Hebrew understanding |
| NeuLabs-TedTalks (translation) | OPUS NeuLab-TedTalks en-he subset | English to Hebrew and Hebrew to English translation quality | Translation products, multilingual apps |
| HebNLI | HebArabNlpProject/HebNLI | Natural Language Inference in Hebrew | Classification, content moderation, logical reasoning |
| HEBREW-MMLU (general knowledge) | Hebrew-translated MMLU subset, used by the Open Hebrew LLM Leaderboard ecosystem (verify the active HF mirror before use; openai/MMMLU covers 14 languages but Hebrew is not in the official set) | 14-subject general-knowledge accuracy; Hebrew adaptation of Massive Multitask Language Understanding | General-purpose chat/RAG products that need broad world knowledge in Hebrew |
| DictaLM 3.0 Summarization | Dicta benchmark suite (see DictaLM 3.0 technical report) | Abstractive summarization of Hebrew news | Summarization tools, executive briefings |
| DictaLM 3.0 Nikud | Dicta benchmark suite | Adding vowel diacritics to unvocalized Hebrew | Educational tools, TTS preprocessing, religious text tools |
| DictaLM 3.0 Israeli Trivia | Dicta benchmark suite | Knowledge of Israeli culture, geography, history, politics | Consumer products where cultural grounding matters |
Rule of thumb: start with HeQ (comprehension) plus one task that matches your specific product. Adding benchmarks past three rarely changes the decision. For products that need broad world knowledge, add HEBREW-MMLU.
Wrap the benchmarks above in an established eval framework rather than rolling a runner from scratch:
lm-evaluation-harness (EleutherAI): standard for reproducible base-model evals, used by the HuggingFace Open LLM Leaderboard. Hebrew tasks like HeQ, HebrewSentiment, and HebNLI are NOT shipped as native tasks (last checked April 2026); add them as custom YAML tasks pointing at the HF dataset IDs above. Good fit when comparing open-weight models with consistent few-shot prompting.inspect_ai (UK AI Security Institute): opinionated framework with primitives for dataset, Task, Solver, and Scorer, plus multi-turn agent flows and a log viewer. Adopted by Anthropic, DeepMind, and others through 2024-25. Good fit for chat-model evals and graded scoring. Companion repo UKGovernmentBEIS/inspect_evals ships 200+ pre-built evals; Hebrew tasks are not in the default set but the harness is straightforward to extend.Pick lm-evaluation-harness for base-model leaderboard parity, pick inspect_ai for chat-model and agent evals.
A sensible default set for Israeli product teams:
| Provider | Model | Call via |
|----------|-------|----------|
| Anthropic | claude-opus-4-7 (1M context), claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5 | Anthropic SDK |
| OpenAI | gpt-5 family | OpenAI SDK |
| Google | gemini-3 family (gemini-3-pro, gemini-3-flash), gemini-2.5-pro | Google GenAI SDK |
| AI21 (Israeli) | jamba-1.6-large, jamba-1.6-mini, jamba-reasoning-3b (open weights), legacy jamba-1.5-large, jamba-1.5-mini | AI21 SDK or Amazon Bedrock |
| Dicta (Israeli, open-weight) | dicta-il/DictaLM-3.0-24B-Base, dicta-il/DictaLM-3.0-Nemotron-12B-Instruct, dicta-il/DictaLM-3.0-1.7B-Thinking-GGUF, dicta-il/DictaLM-3.0-24B-Thinking, plus dicta-il/dictalm2.0-instruct (DictaLM 2.0, 7B Mistral-based) | HuggingFace transformers or vLLM |
| Cohere (multilingual, Hebrew supported) | CohereLabs/aya-expanse-8b, CohereLabs/aya-expanse-32b, CohereLabs/aya-23-8B, CohereLabs/aya-23-35B | HuggingFace transformers or Cohere API |
| Hebrew-finetuned community models | yam-peleg/Hebrew-Mistral-7B, yam-peleg/Hebrew-Gemma-11B-Instruct, yam-peleg/Hebrew-Mixtral-8x22B | HuggingFace transformers |
| Meta (open-weight) | Llama-3.x-70B-Instruct | HuggingFace transformers or hosted |
| Mistral (open-weight) | Mistral-Large-Instruct | HuggingFace transformers or hosted |
AI21 explicitly positions Jamba 1.5 as supporting Hebrew as a core language. DictaLM is the strongest Hebrew-native open-weight option. Cohere's Aya-23 and Aya Expanse list Hebrew among their supported languages. Yam Peleg's Hebrew-* community models are continuously pretrained from Mistral, Gemma, and Mixtral with extended Hebrew tokenizers. Include at least one Hebrew-native model as a baseline, or the comparison tells you nothing about Hebrew-specific performance.
Use scripts/run_eval.py as the runner. It loads benchmarks from HuggingFace, calls the configured model endpoints, and writes results to disk.
pip install datasets transformers anthropic openai google-genai ai21
export ANTHROPIC_API_KEY=...
export OPENAI_API_KEY=...
export GOOGLE_API_KEY=...
export AI21_API_KEY=...
python scripts/run_eval.py --benchmark heq --model claude-sonnet-4-6 --limit 100
python scripts/run_eval.py --suite hebrew-core --models claude-sonnet-4-6,gpt-5,jamba-1.5-large
The harness uses few-shot prompting for base models and chat format for hosted models. Each benchmark has its own prompt template and scorer in scripts/benchmarks/.
Each benchmark has a primary metric:
| Benchmark | Primary metric | Secondary | |-----------|---------------|-----------| | HeQ | F1, Exact Match | Unanswerable accuracy | | HebrewSentiment | Accuracy | Macro-F1 | | Hebrew Winograd | Accuracy | None | | Translation | BLEU, chrF | Human preference | | HebNLI | Accuracy | Macro-F1 | | Summarization | ROUGE-L, BERTScore-HE | Human preference | | Nikud | Word Accuracy | Character Accuracy | | Israeli Trivia | Accuracy | Category breakdown |
Use scripts/score_results.py to compute metrics. It handles HeQ normalization (Hebrew sofit forms, nikud removal, whitespace).
Use scripts/make_scorecard.py to generate a comparison report. Output includes JSON for programmatic use, markdown with a model-vs-benchmark table, per-benchmark winner and gap analysis, and a weighted recommendation.
Example output excerpt (ILLUSTRATIVE PLACEHOLDERS, NOT MEASURED RESULTS):
| Model | HeQ F1 | Sentiment | Winograd | Trans BLEU | Weighted |
|--------------------------------|--------|-----------|----------|------------|----------|
| MODEL_A (placeholder) | XX.X | XX.X | XX.X | XX.X | XX.X |
| MODEL_B (placeholder) | XX.X | XX.X | XX.X | XX.X | XX.X |
| MODEL_C (placeholder) | XX.X | XX.X | XX.X | XX.X | XX.X |
| MODEL_D (placeholder) | XX.X | XX.X | XX.X | XX.X | XX.X |
The numbers above are placeholders for the shape of the scorecard, not real benchmark scores. Run your own evaluation (see Step 4) to fill in real values; actual results depend on prompts, dataset slices, sampling parameters, and the snapshot date of API-hosted models. Always attach the run config to the scorecard.
A single run on a small subset is not a benchmark. Before trusting a scorecard:
scripts/run_eval.py --runs 3 --samples 1000 handles multi-run aggregation.
API-hosted models change silently. Log the exact model version string from each API response where available. For Claude, use the dated model ID (e.g. claude-opus-4-7-20260415) and log the model field returned in the response. For OpenAI, log the model field from the response and the system_fingerprint when available. For Gemini, log the modelVersion from the response metadata. This makes historical scorecards reproducible.
Tokenizer differences materially affect Hebrew evals on cost, latency, and even task accuracy:
Implications for evals:
Report a "fertility table" alongside the scorecard: model, mean tokens per Hebrew word on the same reference paragraph (we use the first 1,000 words of the HeQ test set as a fixed sample). scripts/measure_fertility.py computes this.
HeQ scoring already calls Dicta-compatible normalization. Sentiment, NLI, and translation evals also need normalization or you will see artificial losses:
scripts/score_results.py --normalize hebrew-strict applies all of the above. --normalize hebrew-loose skips sofit and prefix removal for translation evals where they would change meaning.
For graded scoring (summarization, translation, open-ended QA), an LLM judge is convenient but has Hebrew-specific failure modes:
Mitigations:
User says: "We are building a Hebrew news summary feature and need to pick between Claude, GPT, and DictaLM."
Actions:
python scripts/run_eval.py --suite hebrew-summary --models claude-sonnet-4-6,gpt-5,DictaLM-3.0-24B-Base --samples 1000 --runs 3Result: Data-backed model choice with a reproducible scorecard.
User says: "Anthropic just released a new model version. Did Hebrew quality improve or regress?"
Actions:
scripts/diff_scorecards.py prev.json new.jsonResult: Informed upgrade decision instead of blind follow-the-provider.
scripts/run_eval.py -- Main harness. Loads benchmarks from HuggingFace, calls model endpoints, writes raw outputs to disk. Run: python scripts/run_eval.py --helpscripts/score_results.py -- Loads raw outputs and computes per-benchmark metrics (F1, BLEU, accuracy, ROUGE) with Hebrew-specific normalization. Run: python scripts/score_results.py --helpscripts/make_scorecard.py -- Aggregates scores into a JSON and markdown scorecard with weighted recommendation. Run: python scripts/make_scorecard.py --helpreferences/benchmark-catalog.md -- Full catalog of Hebrew LLM benchmarks with HuggingFace IDs, licenses, sample counts, and prompt templates. Consult when adding a new benchmark.references/prompt-templates.md -- Zero-shot, few-shot, and chain-of-thought templates per benchmark, in English and Hebrew. Consult when tuning prompts.No MCP server is required for running evals. Consider pairing with Hebrew data-source MCPs if you need to collect additional real-world test data beyond public benchmarks.
| Source | URL | What to Check | |--------|-----|---------------| | Open Hebrew LLM Leaderboard (live space) | https://huggingface.co/spaces/hebrew-llm-leaderboard/leaderboard | Live rankings, model submissions, current benchmark scores | | Open Hebrew LLM Leaderboard (announcement blog) | https://huggingface.co/blog/leaderboard-hebrew | Leaderboard methodology, benchmark sources | | HeQ dataset (HF mirror) | https://huggingface.co/datasets/Etelis/HeQ_v1 | Dataset card, sample format. Canonical source: github.com/NNLP-IL/Hebrew-Question-Answering-Dataset | | HebrewSentiment dataset | https://huggingface.co/datasets/HebArabNlpProject/HebrewSentiment | License, splits, label definitions | | HebNLI dataset | https://huggingface.co/datasets/HebArabNlpProject/HebNLI | License, splits, premise-hypothesis structure | | DictaLM 3.0 Technical Report | https://dicta.org.il/publications/DictaLM_3_0___Techincal_Report.pdf | Dicta's Hebrew benchmark suite and methodology (note: filename uses "Techincal" not "Technical") | | Dicta organization on HuggingFace | https://huggingface.co/dicta-il | Latest DictaLM 3.0 variants (24B-Base, Nemotron-12B-Instruct, 1.7B-Thinking-GGUF, 24B-Thinking) and DictaBERT models | | Cohere Aya organization | https://huggingface.co/CohereLabs | Aya-23 (8B/35B) and Aya Expanse (8B/32B) multilingual models with Hebrew support | | Yam Peleg Hebrew models | https://huggingface.co/yam-peleg | Hebrew-Mistral-7B, Hebrew-Gemma-11B-Instruct, Hebrew-Mixtral-8x22B community finetunes | | AI21 Jamba model family announcement | https://www.ai21.com/blog/announcing-jamba-model-family/ | Jamba Hebrew support and model specs | | EleutherAI lm-evaluation-harness | https://github.com/EleutherAI/lm-evaluation-harness | Standard base-model eval framework; Hebrew tasks must be added as custom YAMLs | | UK AISI Inspect AI | https://github.com/UKGovernmentBEIS/inspect_ai | Chat-model eval framework with agent and graded-scoring primitives | | Hebrew NLP Resources index | https://github.com/NNLP-IL/Hebrew-Resources | Comprehensive list of Hebrew NLP datasets and tools |
Cause: Too many parallel calls on a free tier or low quota.
Solution: Reduce --parallel in run_eval.py (default 4). For Anthropic and OpenAI, respect their request-rate guidance. Retries with exponential backoff are implemented in the runner.
Cause: Exact match normalization is not applied. Hebrew whitespace, nikud, and sofit variations cause false negatives.
Solution: Use F1 as primary metric. Apply Dicta-compatible normalization via scripts/score_results.py --normalize hebrew.
Cause: BLEU is unreliable on Hebrew due to morphology. Solution: Use chrF alongside BLEU. Rate a sample of the lowest-scoring outputs manually.
tools
Best practices for using browser-use/video-use to edit Hebrew videos end-to-end with Claude Code. Covers the Hebrew-specific deltas to video-use's 12 Hard Rules: SUB_FORCE_STYLE override (Helvetica has no Hebrew glyphs), the python-bidi pre-shape recipe for libass+SRT BiDi failures on macOS, Hebrew filler-word post-pass on Scribe word timestamps, fontsdir= parameter for reliable font discovery, takes_packed.md handling for Hebrew with sofit/nikud/code-switching, and animation slot guidance that defers to hyperframes-best-practices and remotion-best-practices. Use when editing Hebrew talking-head video, podcast clips, tutorials, or marketing video with video-use. Do NOT use for non-Hebrew video-use sessions (read upstream SKILL.md directly), Hebrew podcast audio-only post-production (use hebrew-podcast-postproduction), or generic FFmpeg work without video-use orchestration.
development
Best practices for authoring presentations with open-slide, the React slide framework with a fixed 1920×1080 canvas, with full Hebrew and RTL support. Covers the slides/[id]/index.tsx file contract, type scale, DesignSystem tokens, themes/ system, @slide-comment inspector markers, current.json deictic resolution, Hebrew Google Fonts (Heebo, Rubik, Assistant, Noto Sans Hebrew), CSS logical properties, bidirectional Hebrew+English text with the bdi element, and Hebrew-aware type scale tuning. Use when authoring or editing slides under slides/[id]/ in an open-slide project, or when building Hebrew or bilingual decks on the framework. Do NOT use for video creation (use remotion-best-practices or hyperframes-best-practices), or for generic Hebrew presentations outside open-slide (use presentation-generator).
development
Best practices for programmatic video creation using HyperFrames, plain HTML compositions with GSAP animations rendered to MP4, with full Hebrew and RTL support. Covers composition authoring, data-* timing attributes, GSAP timeline contract, layout-before-animation methodology, visual identity gate, Hebrew fonts via Google Fonts (Heebo, Rubik, Assistant), RTL text rendering with dir="rtl", Hebrew TikTok/Reels-style captions via Whisper, audio-reactive visuals, scene transitions, and bidirectional Hebrew+English text. Use when building HTML-based video content or Hebrew social/marketing videos without React. Do NOT use for Remotion or general React video work, use remotion-best-practices for that.
tools
Build Zapier Zaps connecting Israeli business apps (Morning/Green Invoice, Cardcom, Tranzila, iCount, Grow) with global services for billing, payment, and workflow automation. Use when asked to "create a Zap for Israeli invoicing", "automate Morning receipts", "connect Cardcom to my CRM", or set up payment notifications. Covers Hebrew text handling, ILS formatting, bimonthly VAT logic, Invoice Reform 2026, Zapier AI (Copilot, Agents, MCP), and webhooks from Israeli processors. All amounts use decimal shekels, not agorot. Customer WhatsApp requires Twilio/WATI (not Zapier native). Do NOT use for n8n (use n8n-hebrew-workflows), Make.com (use make-com-israeli-automations), or non-Zapier automation.