Ground Truth Evaluation

Validate existing ground truths to ensure they accurately represent what search SHOULD return.

Priorities

Minimal working system — Prove the approach with ~33 foundational ground truths
Product integration — Integrate semantic search into useful teacher tools
Expansion — Once proven, plan GT expansions for improved coverage

Core Principles

We Test OUR Value, Not Elasticsearch

We know Elasticsearch works. We test whether our search service with our data delivers value to teachers. Don't evaluate ground truths for morphological variation, disambiguation, or other ES features.

We Enable Teachers, Not Police Them

When evaluating, don't flag "difficulty mismatch" as an issue. Teachers can search for anything.

Metadata Is the Default

ALL search works on metadata. Don't evaluate MFL/PE differently because they lack transcripts — metadata-based search is the foundation for ALL subjects.

No Redundant Subject Terms

Flag queries that include the subject name when already filtered. "French negation" filtered to French is redundant.

Critical: Teacher Persona

ALL ground truths are from the perspective of a PROFESSIONAL TEACHER in the UK searching for curriculum content to teach. When evaluating, always ask: "Would a UK teacher searching for this topic find these lessons useful?"

Bulk Data Access

IMPORTANT: The bulk-downloads/ directory is gitignored. File search tools will NOT see these files.

Use shell commands:

ls bulk-downloads/
jq '.lessons[] | {slug: .lessonSlug, title: .lessonTitle}' bulk-downloads/SUBJECT-PHASE.json

Cardinal Rules

Rule 1: Search Might Be RIGHT. Expected Slugs Might Be WRONG.

When MRR is low, ask: "Are the expected slugs actually the best matches?"

Session 9 proved this: MRR 0.000 was blamed on "search quality". After investigation, expected slugs used "emotions" but query said "feel". Search correctly prioritised "feelings" lessons. After correction: MRR 0.000 -> 1.000.

Rule 2: Form YOUR Assessment BEFORE Seeing Search Results

True independent discovery means: identify best lessons from curriculum content, COMMIT to rankings, ONLY THEN compare with search.

This is NOT independent discovery:

Run benchmark
See what search returned
Justify those results as "good"

Rule 3: Every Query Requires FRESH Analysis

Even when queries have "similar semantic intent", you MUST do fresh bulk exploration for EACH query. Copying expected slugs is FORBIDDEN.

Foundational Ground Truth Structure

The current system uses LessonGroundTruth entries:

export const MATHS_SECONDARY: LessonGroundTruth = {
  subject: 'maths',
  phase: 'secondary',
  keyStage: 'ks3',
  query: 'dividing fractions using reciprocals',
  expectedRelevance: {
    'dividing-a-fraction-by-a-fraction': 3,
    'dividing-with-decimals': 2,
    'checking-and-securing-dividing-a-fraction-by-a-whole-number': 2,
  },
  description: 'Lesson teaches dividing fractions by fractions using diagrams and the reciprocal method.',
} as const;

Relevance Scores

| Score | Meaning | |-------|---------| | 3 | Direct match — teaches exactly what query asks | | 2 | Related — covers topic but not directly | | 1 | Tangential — mentions concept peripherally |

Metrics Reference

MRR (Mean Reciprocal Rank)

Position of first relevant result.

| MRR | Meaning | |-----|---------| | > 0.90 | Excellent - first result almost always relevant | | > 0.70 | Good - relevant result usually in top 2 | | > 0.50 | Fair - relevant result usually in top 3 | | < 0.50 | Poor - users must scroll |

NDCG@10 (Normalized Discounted Cumulative Gain)

Overall ranking quality across top 10.

| NDCG | Meaning | |------|---------| | > 0.85 | Excellent - near-optimal ranking | | > 0.75 | Good - highly relevant results near top | | > 0.60 | Fair - some ranking issues | | < 0.60 | Poor - significant ranking problems |

P@3 (Precision at 3)

Of top 3 results, what proportion are relevant?

| P@3 | Meaning | |-----|---------| | > 0.80 | Excellent - most results relevant | | > 0.60 | Good - majority useful | | > 0.40 | Fair - some noise | | < 0.40 | Poor - too many irrelevant |

R@10 (Recall at 10)

Of all relevant results, what proportion found in top 10?

| R@10 | Meaning | |------|---------| | > 0.80 | Excellent - finding almost all | | > 0.60 | Good - finding most | | > 0.40 | Fair - missing some | | < 0.40 | Poor - systematically missing content |

Diagnostic Patterns

| Pattern | Interpretation | Action | |---------|----------------|--------| | High R@10 + Low MRR | Results found but poorly ranked | Search ranking issue | | Low R@10 | Expected slugs not in results | GT likely wrong | | High MRR + Low NDCG | First result good, rest poor | Ranking tail issue | | Low P@3 | Too much noise in top results | Query too broad |

Quick Review Process

Step 1: Test Query via CLI Search Command

cd apps/oak-search-cli
pnpm oaksearch search lessons "query" --subject subject --key-stage keyStage

Step 2: Explore Curriculum Data

Use MCP tools to verify relevance:

get-lessons-summary: lesson="lesson-slug"
get-units-summary: unit="unit-slug"

Step 3: Verify with Direct ES Queries

For imprecise-input or ranking issues:

source .env.local
curl -s "${ELASTICSEARCH_URL}/oak_lessons/_search" \
  -H "Authorization: ApiKey ${ELASTICSEARCH_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "query": {"bool": {
      "must": [{"match": {"lesson_title": {"query": "term", "fuzziness": "AUTO"}}}],
      "filter": [{"term": {"subject_slug": "subject"}}]
    }},
    "size": 5,
    "_source": ["lesson_slug", "lesson_title"]
  }' | jq '.hits.hits[]._source'

Step 4: Update Ground Truth

Based on evidence:

Update query if it lacks differentiation power
Update expectedRelevance with qualitatively best matches
Update description to explain what the test proves

Step 5: Validate

pnpm type-check
pnpm test

Troubleshooting

Low MRR Despite Good Results

Symptom: Benchmark shows useful results, but expected slugs not found.

Diagnosis: Expected slugs may not be the best matches.

Fix:

Look at what search actually returns
Use MCP tools to verify if returned results are qualitatively better
Update expectedRelevance if search found genuinely better lessons

Query Too Broad

Symptom: Many relevant results, hard to pick expected slugs.

Diagnosis: Query lacks specificity.

Fix: Add distinguishing terms:

"fractions" -> "adding fractions unlike denominators"
"democracy" -> "democracy voting elections UK"

Query/Slug Term Mismatch

Symptom: Low MRR, expected slugs don't contain query terms.

Diagnosis: Query says "verbs" but expected slugs have "avoir" without "verb" in keywords.

Fix: Either redesign query to match content terminology, or find slugs that actually contain query-relevant terms.

Anti-Patterns

Search Validation (NOT Independent Discovery)

WRONG:

Run benchmark -> see search returns A, B, C
Get MCP summaries for A, B, C
Note they have relevant content
Conclude "A, B, C are good"
COMMIT table matches search exactly

CORRECT:

Search bulk data -> find candidates X, Y, Z, A, B, W... (10+ slugs)
Analyse each against query
Realise X and Y directly match; A and B are tangential
COMMIT: X=#1, Y=#2, W=#3 (BEFORE seeing search)
Run benchmark -> see search returns A, B, C
Three-way comparison shows differences
Conclude: "X and Y are better because..."

Title-Only Discovery

WRONG: Only examine lessons with matching titles.

CORRECT: Review ALL units. MCP summaries reveal content that titles don't suggest.

Additional Resources

For design principles, category-specific guidance, and lessons learned from 25+ review sessions:

apps/oak-search-cli/src/lib/search-quality/ground-truth/GROUND-TRUTH-GUIDE.md
apps/oak-search-cli/docs/IR-METRICS.md

Ground Truth Evaluation

Validate existing ground truths to ensure they accurately represent what search SHOULD return.

Priorities

Minimal working system — Prove the approach with ~33 foundational ground truths
Product integration — Integrate semantic search into useful teacher tools
Expansion — Once proven, plan GT expansions for improved coverage

Core Principles

We Test OUR Value, Not Elasticsearch

We Enable Teachers, Not Police Them

When evaluating, don't flag "difficulty mismatch" as an issue. Teachers can search for anything.

Metadata Is the Default

ALL search works on metadata. Don't evaluate MFL/PE differently because they lack transcripts — metadata-based search is the foundation for ALL subjects.

No Redundant Subject Terms

Flag queries that include the subject name when already filtered. "French negation" filtered to French is redundant.

Critical: Teacher Persona

Bulk Data Access

IMPORTANT: The bulk-downloads/ directory is gitignored. File search tools will NOT see these files.

Use shell commands:

ls bulk-downloads/
jq '.lessons[] | {slug: .lessonSlug, title: .lessonTitle}' bulk-downloads/SUBJECT-PHASE.json

Cardinal Rules

Rule 1: Search Might Be RIGHT. Expected Slugs Might Be WRONG.

When MRR is low, ask: "Are the expected slugs actually the best matches?"

Rule 2: Form YOUR Assessment BEFORE Seeing Search Results

True independent discovery means: identify best lessons from curriculum content, COMMIT to rankings, ONLY THEN compare with search.

This is NOT independent discovery:

Run benchmark
See what search returned
Justify those results as "good"

Rule 3: Every Query Requires FRESH Analysis

Even when queries have "similar semantic intent", you MUST do fresh bulk exploration for EACH query. Copying expected slugs is FORBIDDEN.

Foundational Ground Truth Structure

The current system uses LessonGroundTruth entries:

export const MATHS_SECONDARY: LessonGroundTruth = {
  subject: 'maths',
  phase: 'secondary',
  keyStage: 'ks3',
  query: 'dividing fractions using reciprocals',
  expectedRelevance: {
    'dividing-a-fraction-by-a-fraction': 3,
    'dividing-with-decimals': 2,
    'checking-and-securing-dividing-a-fraction-by-a-whole-number': 2,
  },
  description: 'Lesson teaches dividing fractions by fractions using diagrams and the reciprocal method.',
} as const;

Relevance Scores

| Score | Meaning | |-------|---------| | 3 | Direct match — teaches exactly what query asks | | 2 | Related — covers topic but not directly | | 1 | Tangential — mentions concept peripherally |

Metrics Reference

MRR (Mean Reciprocal Rank)

Position of first relevant result.

NDCG@10 (Normalized Discounted Cumulative Gain)

Overall ranking quality across top 10.

P@3 (Precision at 3)

Of top 3 results, what proportion are relevant?

| P@3 | Meaning | |-----|---------| | > 0.80 | Excellent - most results relevant | | > 0.60 | Good - majority useful | | > 0.40 | Fair - some noise | | < 0.40 | Poor - too many irrelevant |

R@10 (Recall at 10)

Of all relevant results, what proportion found in top 10?

| R@10 | Meaning | |------|---------| | > 0.80 | Excellent - finding almost all | | > 0.60 | Good - finding most | | > 0.40 | Fair - missing some | | < 0.40 | Poor - systematically missing content |

Diagnostic Patterns

Quick Review Process

Step 1: Test Query via CLI Search Command

cd apps/oak-search-cli
pnpm oaksearch search lessons "query" --subject subject --key-stage keyStage

Step 2: Explore Curriculum Data

Use MCP tools to verify relevance:

get-lessons-summary: lesson="lesson-slug"
get-units-summary: unit="unit-slug"

Step 3: Verify with Direct ES Queries

For imprecise-input or ranking issues:

source .env.local
curl -s "${ELASTICSEARCH_URL}/oak_lessons/_search" \
  -H "Authorization: ApiKey ${ELASTICSEARCH_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "query": {"bool": {
      "must": [{"match": {"lesson_title": {"query": "term", "fuzziness": "AUTO"}}}],
      "filter": [{"term": {"subject_slug": "subject"}}]
    }},
    "size": 5,
    "_source": ["lesson_slug", "lesson_title"]
  }' | jq '.hits.hits[]._source'

Step 4: Update Ground Truth

Based on evidence:

Update query if it lacks differentiation power
Update expectedRelevance with qualitatively best matches
Update description to explain what the test proves

Step 5: Validate

pnpm type-check
pnpm test

Troubleshooting

Low MRR Despite Good Results

Symptom: Benchmark shows useful results, but expected slugs not found.

Diagnosis: Expected slugs may not be the best matches.

Fix:

Look at what search actually returns
Use MCP tools to verify if returned results are qualitatively better
Update expectedRelevance if search found genuinely better lessons

Query Too Broad

Symptom: Many relevant results, hard to pick expected slugs.

Diagnosis: Query lacks specificity.

Fix: Add distinguishing terms:

"fractions" -> "adding fractions unlike denominators"
"democracy" -> "democracy voting elections UK"

Query/Slug Term Mismatch

Symptom: Low MRR, expected slugs don't contain query terms.

Diagnosis: Query says "verbs" but expected slugs have "avoir" without "verb" in keywords.

Fix: Either redesign query to match content terminology, or find slugs that actually contain query-relevant terms.

Anti-Patterns

Search Validation (NOT Independent Discovery)

WRONG:

Run benchmark -> see search returns A, B, C
Get MCP summaries for A, B, C
Note they have relevant content
Conclude "A, B, C are good"
COMMIT table matches search exactly

CORRECT:

Search bulk data -> find candidates X, Y, Z, A, B, W... (10+ slugs)
Analyse each against query
Realise X and Y directly match; A and B are tangential
COMMIT: X=#1, Y=#2, W=#3 (BEFORE seeing search)
Run benchmark -> see search returns A, B, C
Three-way comparison shows differences
Conclude: "X and Y are better because..."

Title-Only Discovery

WRONG: Only examine lessons with matching titles.

CORRECT: Review ALL units. MCP summaries reveal content that titles don't suggest.

Additional Resources

For design principles, category-specific guidance, and lessons learned from 25+ review sessions:

apps/oak-search-cli/src/lib/search-quality/ground-truth/GROUND-TRUTH-GUIDE.md
apps/oak-search-cli/docs/IR-METRICS.md

Adoption

oaknational/ground-truth-evaluation

$ install --global

Security Scan Results

SKILL.md

Ground Truth Evaluation

Priorities

Core Principles

We Test OUR Value, Not Elasticsearch

We Enable Teachers, Not Police Them

Metadata Is the Default

No Redundant Subject Terms

Critical: Teacher Persona

Bulk Data Access

Cardinal Rules

Rule 1: Search Might Be RIGHT. Expected Slugs Might Be WRONG.

Rule 2: Form YOUR Assessment BEFORE Seeing Search Results

Rule 3: Every Query Requires FRESH Analysis

Foundational Ground Truth Structure

Relevance Scores

Metrics Reference

MRR (Mean Reciprocal Rank)

NDCG@10 (Normalized Discounted Cumulative Gain)

P@3 (Precision at 3)

R@10 (Recall at 10)

Diagnostic Patterns

Quick Review Process

Step 1: Test Query via CLI Search Command

Step 2: Explore Curriculum Data

Step 3: Verify with Direct ES Queries

Step 4: Update Ground Truth

Step 5: Validate

Troubleshooting

Low MRR Despite Good Results

Query Too Broad

Query/Slug Term Mismatch

Anti-Patterns

Search Validation (NOT Independent Discovery)

Title-Only Discovery

Additional Resources

Related Skills

oaknational/worktrees

oaknational/tsdoc

oaknational/systematic-debugging

oaknational/start-right-thorough

oaknational/ground-truth-evaluation

$ install --global

Security Scan Results

SKILL.md

Ground Truth Evaluation

Priorities

Core Principles

We Test OUR Value, Not Elasticsearch

We Enable Teachers, Not Police Them

Metadata Is the Default

No Redundant Subject Terms

Critical: Teacher Persona

Bulk Data Access

Cardinal Rules

Rule 1: Search Might Be RIGHT. Expected Slugs Might Be WRONG.

Rule 2: Form YOUR Assessment BEFORE Seeing Search Results

Rule 3: Every Query Requires FRESH Analysis

Foundational Ground Truth Structure

Relevance Scores

Metrics Reference

MRR (Mean Reciprocal Rank)

NDCG@10 (Normalized Discounted Cumulative Gain)

P@3 (Precision at 3)

R@10 (Recall at 10)

Diagnostic Patterns

Quick Review Process

Step 1: Test Query via CLI Search Command

Step 2: Explore Curriculum Data

Step 3: Verify with Direct ES Queries

Step 4: Update Ground Truth

Step 5: Validate

Troubleshooting

Low MRR Despite Good Results

Query Too Broad

Query/Slug Term Mismatch