.claude/skills/prediction-tracking/SKILL.md
Track and evaluate AI predictions over time to assess accuracy. Use when reviewing past predictions to determine if they came true, failed, or remain uncertain.
npx skillsauth add rickoslyder/HypeDelta prediction-trackingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Track predictions made by AI researchers and critics, evaluate their accuracy over time.
When recording a new prediction, capture:
When evaluating predictions, assign one of:
verifiedClearly came true as stated.
falsifiedClearly did not come true.
partially-verifiedPartially accurate.
too-earlyNot enough time has passed.
unfalsifiableCannot be objectively assessed.
ambiguousPrediction was too vague to evaluate.
For each prediction being evaluated:
What exactly was claimed?
Has enough time passed to evaluate?
What has happened since?
Which evaluation status applies?
If verifiable, rate 0.0-1.0:
What does this tell us about:
For evaluation:
{
"evaluations": [
{
"predictionId": "id",
"status": "verified",
"accuracyScore": 0.85,
"evidence": "Description of evidence",
"notes": "Additional context",
"evaluatedAt": "timestamp"
}
]
}
For accuracy statistics:
{
"author": "Author name",
"totalPredictions": 15,
"verified": 5,
"falsified": 3,
"partiallyVerified": 2,
"pending": 4,
"unfalsifiable": 1,
"averageAccuracy": 0.62,
"topicBreakdown": {
"reasoning": { "predictions": 5, "accuracy": 0.7 },
"agents": { "predictions": 3, "accuracy": 0.4 }
},
"calibration": "Assessment of how well-calibrated they are"
}
Evaluate whether predictors are well-calibrated:
Keep running assessments of key voices:
| Predictor | Total | Accuracy | Calibration | Notes | |-----------|-------|----------|-------------|-------| | Sam Altman | 20 | 55% | Overconfident | Timeline optimism | | Gary Marcus | 15 | 70% | Well-calibrated | Conservative | | Dario Amodei | 12 | 65% | Slightly over | Safety-focused |
Watch for prediction patterns that suggest bias:
development
Filter and classify AI research content for relevance. Use when processing raw content from Twitter, Substacks, blogs, or podcasts to determine if it's worth extracting claims from. Assigns relevance scores, topics, and author categories.
data-ai
Synthesize claims across multiple sources to identify consensus, disagreements, and emerging narratives on AI research topics. Use when you have claims from both lab researchers and critics on the same topic and need to understand where they agree, disagree, and what the overall hype level is.
testing
Assess overall hype levels across AI topics by comparing lab researcher enthusiasm against critic skepticism. Use after topic synthesis to identify which topics are overhyped, underhyped, or accurately assessed by the field.
data-ai
Detect hints about unreleased AI research or capabilities from lab researcher communications. Use when analyzing tweets, posts, or interviews from people at major AI labs to identify signals about upcoming work.