.agent/skills/ground-truth-evaluation/SKILL.md
Evaluate and validate ground truths for Oak semantic search using the COMMIT protocol. Use when reviewing existing ground truths, diagnosing low MRR scores, interpreting benchmark results, or running pnpm benchmark.
npx skillsauth add oaknational/oak-open-curriculum-ecosystem ground-truth-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Validate existing ground truths to ensure they accurately represent what search SHOULD return.
We know Elasticsearch works. We test whether our search service with our data delivers value to teachers. Don't evaluate ground truths for morphological variation, disambiguation, or other ES features.
When evaluating, don't flag "difficulty mismatch" as an issue. Teachers can search for anything.
ALL search works on metadata. Don't evaluate MFL/PE differently because they lack transcripts — metadata-based search is the foundation for ALL subjects.
Flag queries that include the subject name when already filtered. "French negation" filtered to French is redundant.
ALL ground truths are from the perspective of a PROFESSIONAL TEACHER in the UK searching for curriculum content to teach. When evaluating, always ask: "Would a UK teacher searching for this topic find these lessons useful?"
IMPORTANT: The bulk-downloads/ directory is gitignored. File search tools will NOT see these files.
Use shell commands:
ls bulk-downloads/
jq '.lessons[] | {slug: .lessonSlug, title: .lessonTitle}' bulk-downloads/SUBJECT-PHASE.json
When MRR is low, ask: "Are the expected slugs actually the best matches?"
Session 9 proved this: MRR 0.000 was blamed on "search quality". After investigation, expected slugs used "emotions" but query said "feel". Search correctly prioritised "feelings" lessons. After correction: MRR 0.000 -> 1.000.
True independent discovery means: identify best lessons from curriculum content, COMMIT to rankings, ONLY THEN compare with search.
This is NOT independent discovery:
Even when queries have "similar semantic intent", you MUST do fresh bulk exploration for EACH query. Copying expected slugs is FORBIDDEN.
The current system uses LessonGroundTruth entries:
export const MATHS_SECONDARY: LessonGroundTruth = {
subject: 'maths',
phase: 'secondary',
keyStage: 'ks3',
query: 'dividing fractions using reciprocals',
expectedRelevance: {
'dividing-a-fraction-by-a-fraction': 3,
'dividing-with-decimals': 2,
'checking-and-securing-dividing-a-fraction-by-a-whole-number': 2,
},
description: 'Lesson teaches dividing fractions by fractions using diagrams and the reciprocal method.',
} as const;
| Score | Meaning | |-------|---------| | 3 | Direct match — teaches exactly what query asks | | 2 | Related — covers topic but not directly | | 1 | Tangential — mentions concept peripherally |
Position of first relevant result.
| MRR | Meaning | |-----|---------| | > 0.90 | Excellent - first result almost always relevant | | > 0.70 | Good - relevant result usually in top 2 | | > 0.50 | Fair - relevant result usually in top 3 | | < 0.50 | Poor - users must scroll |
Overall ranking quality across top 10.
| NDCG | Meaning | |------|---------| | > 0.85 | Excellent - near-optimal ranking | | > 0.75 | Good - highly relevant results near top | | > 0.60 | Fair - some ranking issues | | < 0.60 | Poor - significant ranking problems |
Of top 3 results, what proportion are relevant?
| P@3 | Meaning | |-----|---------| | > 0.80 | Excellent - most results relevant | | > 0.60 | Good - majority useful | | > 0.40 | Fair - some noise | | < 0.40 | Poor - too many irrelevant |
Of all relevant results, what proportion found in top 10?
| R@10 | Meaning | |------|---------| | > 0.80 | Excellent - finding almost all | | > 0.60 | Good - finding most | | > 0.40 | Fair - missing some | | < 0.40 | Poor - systematically missing content |
| Pattern | Interpretation | Action | |---------|----------------|--------| | High R@10 + Low MRR | Results found but poorly ranked | Search ranking issue | | Low R@10 | Expected slugs not in results | GT likely wrong | | High MRR + Low NDCG | First result good, rest poor | Ranking tail issue | | Low P@3 | Too much noise in top results | Query too broad |
cd apps/oak-search-cli
pnpm oaksearch search lessons "query" --subject subject --key-stage keyStage
Use MCP tools to verify relevance:
get-lessons-summary: lesson="lesson-slug"
get-units-summary: unit="unit-slug"
For imprecise-input or ranking issues:
source .env.local
curl -s "${ELASTICSEARCH_URL}/oak_lessons/_search" \
-H "Authorization: ApiKey ${ELASTICSEARCH_API_KEY}" \
-H "Content-Type: application/json" \
-d '{
"query": {"bool": {
"must": [{"match": {"lesson_title": {"query": "term", "fuzziness": "AUTO"}}}],
"filter": [{"term": {"subject_slug": "subject"}}]
}},
"size": 5,
"_source": ["lesson_slug", "lesson_title"]
}' | jq '.hits.hits[]._source'
Based on evidence:
pnpm type-check
pnpm test
Symptom: Benchmark shows useful results, but expected slugs not found.
Diagnosis: Expected slugs may not be the best matches.
Fix:
Symptom: Many relevant results, hard to pick expected slugs.
Diagnosis: Query lacks specificity.
Fix: Add distinguishing terms:
Symptom: Low MRR, expected slugs don't contain query terms.
Diagnosis: Query says "verbs" but expected slugs have "avoir" without "verb" in keywords.
Fix: Either redesign query to match content terminology, or find slugs that actually contain query-relevant terms.
WRONG:
CORRECT:
WRONG: Only examine lessons with matching titles.
CORRECT: Review ALL units. MCP summaries reveal content that titles don't suggest.
For design principles, category-specific guidance, and lessons learned from 25+ review sessions:
apps/oak-search-cli/src/lib/search-quality/ground-truth/GROUND-TRUTH-GUIDE.mdapps/oak-search-cli/docs/IR-METRICS.mdtools
When and how to use git worktrees for isolated work.
documentation
TSDoc and documentation workflow for canonical source comments, README updates, and ADR touchpoints.
development
Structured debugging workflow: reproduce, isolate, hypothesise, verify, fix, regression test.
data-ai
Load the shared thorough start-right workflow from `.agent/skills/start-right-thorough/shared/start-right-thorough.md`.