skills/43-wentorai-research-plugins/skills/domains/education/assessment-design-guide/SKILL.md
Psychometrics and educational assessment design for researchers
npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research assessment-design-guideInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A skill for designing, validating, and analyzing educational assessments using modern psychometric methods. Covers classical test theory, item response theory, test construction, validity evidence, and computerized adaptive testing.
Classical test theory (CTT) models observed scores as the sum of a true score and error:
X = T + E
Key reliability coefficients:
| Coefficient | Method | Interpretation | |-------------|--------|----------------| | Cronbach's alpha | Internal consistency | Homogeneity of items | | Test-retest | Stability over time | Temporal consistency | | Parallel forms | Equivalent test versions | Form equivalence | | Split-half (Spearman-Brown) | Odd-even item split | Internal consistency | | Inter-rater (Cohen's kappa) | Multiple raters | Scoring agreement |
import numpy as np
import pandas as pd
def item_analysis(responses: pd.DataFrame, total_scores: pd.Series) -> pd.DataFrame:
"""
Classical item analysis: difficulty, discrimination, point-biserial.
responses: binary DataFrame (1=correct, 0=incorrect), items as columns.
total_scores: total test score for each examinee.
"""
results = []
for item in responses.columns:
scores = responses[item]
difficulty = scores.mean() # p-value (proportion correct)
# Point-biserial correlation
corr = scores.corr(total_scores)
# Upper-lower discrimination (top/bottom 27%)
n = len(total_scores)
cutoff_high = total_scores.quantile(0.73)
cutoff_low = total_scores.quantile(0.27)
upper = scores[total_scores >= cutoff_high].mean()
lower = scores[total_scores <= cutoff_low].mean()
discrimination = upper - lower
results.append({
"item": item,
"difficulty": round(difficulty, 3),
"discrimination": round(discrimination, 3),
"point_biserial": round(corr, 3),
"flag": "review" if difficulty < 0.2 or difficulty > 0.9
or discrimination < 0.2 else "ok"
})
return pd.DataFrame(results)
IRT provides a more rigorous framework than CTT by modeling the probability of a correct response as a function of ability and item parameters:
import numpy as np
def irt_3pl(theta: float, a: float, b: float, c: float) -> float:
"""
Three-parameter logistic IRT model.
theta: examinee ability (typically -3 to +3)
a: discrimination parameter (slope, typically 0.5 to 2.5)
b: difficulty parameter (location, same scale as theta)
c: guessing parameter (lower asymptote, typically 0.0 to 0.35)
Returns: probability of correct response
"""
exponent = -a * (theta - b)
return c + (1 - c) / (1 + np.exp(exponent))
# Item characteristic curves for three items
thetas = np.linspace(-3, 3, 100)
item_easy = [irt_3pl(t, a=1.0, b=-1.0, c=0.2) for t in thetas]
item_medium = [irt_3pl(t, a=1.5, b=0.0, c=0.2) for t in thetas]
item_hard = [irt_3pl(t, a=1.2, b=1.5, c=0.2) for t in thetas]
# Using the 'mirt' package in R (called via rpy2 or standalone)
# R code for fitting a 2PL model:
r_code = """
library(mirt)
# responses: binary matrix (examinees x items)
mod <- mirt(responses, model = 1, itemtype = "2PL")
# Item parameters
coef(mod, simplify = TRUE)
# Ability estimates (Expected A Posteriori)
theta_hat <- fscores(mod, method = "EAP")
# Model fit
M2(mod) # limited-information fit statistic
itemfit(mod, fit_stats = "S_X2")
"""
| Model | Parameters | Use Case | |-------|-----------|----------| | Rasch (1PL) | b only | Equal discrimination assumed; measurement-focused | | 2PL | a, b | Different discrimination; general purpose | | 3PL | a, b, c | Multiple choice with guessing | | Graded Response | a, b_k | Likert-scale or partial credit items | | Nominal Response | a_k, c_k | Multiple choice with informative distractors |
Following the Standards for Educational and Psychological Testing (AERA/APA/NCME, 2014), validity is a unitary concept supported by five types of evidence:
from factor_analyzer import FactorAnalyzer
# Confirmatory approach: check dimensionality
fa = FactorAnalyzer(n_factors=3, rotation="promax")
fa.fit(item_responses)
# Eigenvalues for scree plot
eigenvalues, _ = fa.get_eigenvalues()
print("Eigenvalues:", eigenvalues[:10])
# Factor loadings
loadings = pd.DataFrame(
fa.loadings_,
columns=["Factor1", "Factor2", "Factor3"],
index=item_names
)
print(loadings.round(3))
Computerized adaptive testing selects items in real time to match examinee ability:
Initialize: theta_0 = 0 (prior mean)
For each item i = 1, 2, ..., until stopping rule met:
1. Select item with maximum Fisher information at current theta
2. Administer item, observe response
3. Update theta estimate using maximum likelihood or Bayesian EAP
4. Check stopping rule:
- Fixed length (e.g., 30 items)
- SE(theta) < threshold (e.g., 0.30)
- Maximum time reached
Return: final theta estimate and standard error
To prevent overuse of high-quality items and maintain test security:
development
Conduct rigorous thematic analysis (TA) of qualitative data following Braun and Clarke's (2006) six-phase framework. Use whenever the user mentions 'thematic analysis', 'TA', 'Braun and Clarke', 'qualitative coding', 'identifying themes', or asks for help analysing interviews, focus groups, open-ended survey responses, or transcripts to identify patterns. Also trigger for questions about inductive vs theoretical coding, semantic vs latent themes, essentialist vs constructionist epistemology, building a thematic map, or writing up a qualitative findings section. Covers all six phases, the four upfront analytic decisions, the 15-point quality checklist, and the five common pitfalls. Produces a Word document write-up and an annotated thematic map. Does NOT cover IPA, grounded theory, discourse analysis, conversation analysis, or narrative analysis — use a different method for those.
development
Guide users through writing a systematic literature review (SLR) following the PRISMA 2020 framework. Use this skill whenever the user mentions 'systematic review', 'systematic literature review', 'SLR', 'PRISMA', 'PRISMA 2020', 'PRISMA flow diagram', 'PRISMA checklist', or asks for help writing, structuring, or auditing a literature review that follows reporting guidelines. Also trigger when the user asks about inclusion/exclusion criteria for a review, search strategies for databases like Scopus/WoS/PubMed, study selection processes, risk of bias assessment, or narrative synthesis for a review paper. This skill covers the full PRISMA 2020 checklist (27 items), produces a Word document manuscript in strict journal article format, generates an annotated PRISMA flow diagram, and enforces APA 7th Edition referencing throughout. It does NOT cover meta-analysis or statistical pooling. By Chuah Kee Man.
testing
Performs placebo-in-time sensitivity analysis with hierarchical null model and optional Bayesian assurance. Use when checking model robustness, verifying lack of pre-intervention effects, or estimating study power.
data-ai
Fit, summarize, plot, and interpret a chosen CausalPy experiment. Use after the causal method has been selected, including when configuring PyMC/sklearn models and scale-aware custom priors.