plugins/autonomous-dev/skills/archived/quality-scoring/SKILL.md
Multi-dimensional data assessment for training quality evaluation including IFD scoring, factuality, and reasoning validation. Use when scoring training data or evaluating dataset quality. TRIGGER when: quality scoring, data assessment, IFD, factuality, training data quality. DO NOT TRIGGER when: code quality, test coverage, documentation, non-data tasks.
npx skillsauth add akaszubski/autonomous-dev quality-scoringInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Multi-dimensional assessment for training data quality.
Quality assessment, data scoring, multi-dimensional evaluation, IFD scoring, factuality checks, reasoning validation, training data prep
Fast to comprehensive scoring approaches:
| Type | Quality | IFD | Use Case | |------|---------|-----|----------| | SFT | ≥8.0 | ≥0.3 | Base training | | DPO chosen | ≥9.0 | ≥0.5 | High quality only | | DPO rejected | ≤6.0 | any | Low quality | | RLVR | ≥9.0 | ≥0.5 | Verified solutions | | Calibration | ≥8.0 | ≥0.4 | Uncertainty examples |
| Concept | Details | Reference |
|---------|---------|-----------|
| Scorers | 6 types (FastIFD to Ensemble) | quality-scorers.md |
| Dimensions | 6 metrics (IFD to LLM Quality) | quality-dimensions.md |
| Thresholds | By training type (SFT, DPO, RLVR) | training-thresholds.md |
| Library | training_metrics.py | Integration functions |
from training_metrics import calculate_ifd_score
# IFD = PPL(response) / PPL(response|instruction)
ifd_score = calculate_ifd_score(
instruction="Explain quantum computing",
response="Quantum computing uses qubits..."
)
# Higher score = more challenging
from training_metrics import validate_dpo_pairs
# Validate chosen/rejected quality gap
is_valid = validate_dpo_pairs(
chosen_score=9.2, # High quality
rejected_score=5.8 # Low quality
)
# Ensures quality gap ≥0.15
Every DPO pair MUST have multi-dimensional quality scores before training.
This is a hard requirement — DPO data without quality scores will learn shortcuts (e.g., "longer = better") instead of genuine preference signal.
Required output fields per pair:
chosen_score (float): Composite quality score for chosen responserejected_score (float): Composite quality score for rejected responsemargin (float): chosen_score - rejected_score (must be ≥3.0)Length bias audit (MUST run before DPO training):
from pathlib import Path
from training_metrics import validate_dpo_pairs
metrics = validate_dpo_pairs(dpo_path=Path("dpo_pairs.jsonl"))
# Check length bias
longer_chosen = sum(1 for p in metrics.pairs if len(p.chosen) > len(p.rejected))
length_bias = longer_chosen / metrics.total_pairs
if length_bias > 0.70:
raise ValueError(
f"DPO length bias {length_bias:.0%} > 70% threshold.\n"
f"Model will learn 'longer = better' shortcut.\n"
f"Fix: Score by quality dimensions, not length."
)
# Check quality scores present
missing = sum(1 for p in metrics.pairs if p.chosen_score is None)
if missing > 0:
raise ValueError(f"{missing} pairs missing quality scores — run scoring first")
Scoring workflow:
from training_metrics import assess_rlvr_verifiability
# Assess reasoning trace verifiability
verifiable = assess_rlvr_verifiability(
reasoning_trace="Step 1: ...\nStep 2: ...",
domain="math"
)
# Math/coding: 90%+ verifiable required
Detailed guides: See docs/*.md
docs/quality-scorers.md - 6 scorer implementationsdocs/quality-dimensions.md - 6 dimension definitionsdocs/training-thresholds.md - Thresholds, CLI, distributed performancefrom pathlib import Path
def safe_load_data(data_path: str) -> dict:
"""Load data with path validation."""
# Validate path within allowed directory
path = Path(data_path).resolve()
if not str(path).startswith('/allowed/data/'):
raise ValueError(f"Path outside allowed directory: {path}")
# Load safely
return json.loads(path.read_text())
# Score dataset with FastIFD
python -m training_metrics score \
--input data/train.jsonl \
--output data/scored.jsonl \
--scorer fastifd \
--threshold 0.3
# Multi-dimensional scoring
python -m training_metrics score \
--input data/train.jsonl \
--output data/scored.jsonl \
--scorer multidim \
--quality-threshold 8.0 \
--ifd-threshold 0.5
# DPO pair filtering
python -m training_metrics filter_dpo \
--input data/dpo_pairs.jsonl \
--output data/filtered_pairs.jsonl \
--chosen-threshold 9.0 \
--rejected-threshold 6.0
# RLVR verifiability check
python -m training_metrics assess_rlvr \
--input data/rlvr_traces.jsonl \
--output data/verified.jsonl \
--domain math \
--threshold 0.9
Primary library: training_metrics.py
Key functions:
calculate_ifd_score() - IFD calculationvalidate_dpo_pairs() - DPO pair validationassess_rlvr_verifiability() - RLVR assessmentscore_quality() - Multi-dimensional scoringensemble_score() - Cross-model ensembledevelopment
One topic, one home. Routes content to its canonical store (CLAUDE.md, PROJECT.md, MEMORY.md, docs/, memory/) and audits for duplication. TRIGGER when: auditing CLAUDE.md/PROJECT.md/MEMORY.md sizes, deduplicating docs, applying the content-allocation pattern to a new repo, running /align --content. DO NOT TRIGGER when: implementing features, writing tests, routine code edits, debugging.
development
GenAI-first testing with structural assertions, congruence validation, and tier-based test structure. Use when writing tests, setting up test infrastructure, or validating coverage. TRIGGER when: test, pytest, coverage, TDD, test patterns, congruence, validation. DO NOT TRIGGER when: production code implementation, documentation, config-only changes.
testing
Prompt engineering patterns for writing agent prompts and skill files — constraint budgets, register shifting, HARD GATE patterns, anti-personas. Use when writing or reviewing agents/*.md or skills/*/SKILL.md. TRIGGER when: agent prompt, skill file, prompt engineering, model-tier compensation, HARD GATE, prompt quality. DO NOT TRIGGER when: user-facing docs, README, CHANGELOG, config files.
testing
7-step planning workflow for pre-implementation design. Enforced by plan_gate hook, critiqued by plan-critic agent. Use when creating plans, design documents, or architecture decisions before implementation. TRIGGER when: plan, planning, /plan, design document, architecture decision. DO NOT TRIGGER when: implementation, coding, testing.