Skills/knowledge/autoresearch/SKILL.md
Optimize any AI in SharePoint skill by iteratively running it against test inputs, scoring outputs with binary evals, mutating the prompt to fix failures, and keeping improvements. Adapts Karpathy's autoresearch methodology to AI in SharePoint's multi-turn conversation architecture.
npx skillsauth add zrosenfield/sharepoint-ai-skills autoresearchInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill adapts autonomous experimentation loops (Karpathy-style autoresearch) to AI in SharePoint. It doesn't rewrite the skill from scratch, but instwead runs it repeatedly, scoring every output, to improve it. Instead of spawning sub-agents on a local machine, we work within AI in SharePoint's multi-turn conversation using load_skill and create_skill as our read/write primitives.
Take any existing AI in SharePoint skill, define what "good output" looks like as binary yes/no checks, then run an iterative loop that:
Output: An improved skill (saved via create_skill) + a structured results log in conversation + a changelog of every mutation attempted.
STOP. Do not run any experiments until all fields below are confirmed with the user. Ask for any missing fields before proceeding.
Target skill — Which skill do you want to optimize? (need the exact skill name as used in load_skill)
Test inputs — What 3–5 different prompts/scenarios should we test the skill with? These must include:
If the user only provides happy-path inputs, push back: "What's the hardest input this skill has to handle?"
Eval criteria — What 3–6 binary yes/no checks define a good output? (see references/eval-guide.md for how to write good evals)
Runs per experiment — How many times should we apply the skill per mutation? Default: 3. (In AI in SharePoint, each run is a conversation turn, so 3 balances signal vs speed.)
Budget cap — Optional. Max number of experiment cycles. Default: no cap (runs until user stops or score ceiling hit).
Before changing anything, load and understand the target skill completely.
load_skill(skillName="<target>") to read the full SKILL.mdreferences/, load those too with load_skill(skillName="<target>", filePath="references/<file>")Do NOT skip this. You need to understand what the skill does before you can improve it.
Save the original skill text in your working memory. You'll need it as the baseline to revert to if mutations fail.
Convert the user's eval criteria into structured tests. Every check must be binary — pass or fail, no scales.
Format each eval as:
EVAL [N]: [Short name]
Question: [Yes/no question about the output]
Pass: [What "yes" looks like — be specific]
Fail: [What triggers "no"]
Rules for good evals:
Load references/eval-guide.md for detailed examples and the 3-question test.
Max score calculation:
max_score = [number of evals] × [runs per experiment]
Example: 4 evals × 3 runs = max score of 12.
Present the eval suite to the user for confirmation before proceeding.
Run the skill AS-IS before changing anything. This is Experiment #0.
summarize-email-v2, search-optimized)CRITICAL: Score honestly. You are both the generator and the scorer. Default to FAIL on ambiguity. Require explicit evidence for every PASS. See "Mitigating Self-Scoring Bias" below.
After establishing baseline, confirm the score with the user. If baseline is already 90%+, the skill may not need optimization — ask whether to continue.
Scoring is the most critical step. Because AI in SharePoint doesn't have separate scoring agents, you must be rigorous about self-scoring.
For each output, score every eval:
TEST INPUT: [the input used]
EVAL 1: [name] → PASS | FAIL
Evidence: [specific quote or observation from the output]
Reason: [one sentence explaining the verdict]
EVAL 2: [name] → PASS | FAIL
Evidence: [specific quote or observation]
Reason: [one sentence]
...
SCORE: [passed]/[total evals]
Since you generated the output AND are scoring it, follow these rules strictly:
This is the core autoresearch loop.
For each experiment:
Look at which evals are failing most. Identify the pattern:
Pick ONE thing to change. Don't change 5 things at once — you won't know what helped.
Good mutations:
Bad mutations:
State clearly what you changed and why:
MUTATION: [what changed — one sentence]
HYPOTHESIS: [why this should help — one sentence]
LOCATION: [which part of the skill was modified]
Apply the mutated skill's instructions to the same test inputs. Generate new outputs.
Score every output using the same eval suite and scoring protocol.
Minimum delta for keeping: The improvement must be at least 2 points OR 10% relative improvement (whichever is smaller) to count as real. A 1-point gain on 3 runs is likely noise.
Regression check: Before keeping, compare per-eval scores against the previous best. If ANY individual eval dropped by 2+ points compared to the previous kept version, flag it as a regression — even if the total score improved.
Decision rules:
Record in the structured format (see "Results Format" below).
If you have 3 consecutive discards targeting the same eval, stop. Either:
Go back to step 4a.
Continue until:
Apply these when standard mutations (add instruction, add anti-pattern) aren't working.
Problem: You need content to appear in the output, but it keeps getting truncated or skipped. What doesn't work: Reordering sections. LLMs follow natural narrative flow, not arbitrary section ordering. What works: Embed the high-priority content INTO a section the agent naturally generates first. Merge "Key Recommendation" into "Executive Summary." The agent always generates the opening — if critical content lives there, it can't be truncated.
When output gets cut off:
When the skill's output is vague or inconsistent:
After each experiment, present results in this structure:
═══════════════════════════════════════════
EXPERIMENT [N] — [KEEP / DISCARD]
═══════════════════════════════════════════
Mutation: [what was changed]
Hypothesis: [why this was expected to help]
Per-eval scores:
EVAL 1 [name]: [X]/[runs] (prev: [Y]/[runs]) [↑/↓/=]
EVAL 2 [name]: [X]/[runs] (prev: [Y]/[runs]) [↑/↓/=]
...
Total: [X]/[max] ([percent]%) | Previous best: [Y]/[max] ([percent]%)
Delta: [+/-N] points
Regressions: [list any evals that dropped, or "none"]
Decision: [KEEP/DISCARD] — [one sentence reason]
───────────────────────────────────────────
After each experiment, also maintain a running summary table:
AUTORESEARCH PROGRESS — [skill name]
═══════════════════════════════════════════
Exp Score Rate Status Change
─── ───── ──── ────── ──────
0 8/12 67% baseline (original)
1 9/12 75% KEEP Added anti-hallucination rule
2 9/12 75% DISCARD Tried output cap (no effect)
3 11/12 92% KEEP Added worked example
═══════════════════════════════════════════
Best: 11/12 (92%) | Baseline: 8/12 (67%) | Improvement: +25%
When the user is satisfied or the score ceiling is reached:
create_skill to save the improved version:
create_skill(
name="<user-chosen-name>",
description="<updated description reflecting improvements>",
instructions="<the optimized skill instructions>"
)
Present the top 3–5 changes that helped most (from the experiment log), so the user understands what improved and why.
When the loop completes, present:
create_skill)Long optimization sessions will accumulate conversation history. Follow these rules:
HANDOFF SUMMARY
═══════════════
Skill: [name]
Current best score: [X]/[max] ([percent]%)
Experiments completed: [N]
Current version: [describe the latest kept mutation]
Top 3 failing evals: [list with what's been tried for each]
Last 3 experiments: [keep/discard and why]
Next to try: [suggestion]
Then tell the user: "Start a new conversation, load the autoresearch skill, and paste this summary to continue."Setup:
search-contentBaseline (Experiment 0): Applied skill to all 4 inputs × 3 runs. Result: 8/12 (67%). Per-eval: Relevance 3/3, URLs 2/3, Error handling 1/3, Brevity 2/3. Common failures: Edge case "Find files" returns results but no error guidance. Adversarial input produces hallucinated results.
Experiment 1 — KEEP (10/12, 83%): Mutation: Added "When the query is vague or lacks specifics, ask a clarifying question before searching. Do NOT guess what the user means." Per-eval: Relevance 3/3, URLs 2/3, Error handling 3/3 (+2), Brevity 2/3. Delta: +2 points. No regressions. Error handling went from 1/3 to 3/3.
Experiment 2 — DISCARD (10/12, 83%): Mutation: Added "Always validate URLs exist before including in response." Per-eval: Relevance 3/3, URLs 3/3 (+1), Error handling 3/3, Brevity 1/3 (-1 regression). Delta: 0 net (one up, one down). URL validation instruction made responses longer. Discarded.
Experiment 3 — KEEP (11/12, 92%): Mutation: Added anti-pattern: "NEVER fabricate file names, URLs, or metadata. If search returns no results, say so directly — do not invent plausible-sounding results." Per-eval: Relevance 3/3, URLs 3/3 (+1), Error handling 3/3, Brevity 2/3. Delta: +1 point, but this was the URLs eval going from 2→3 consistently. No regressions. Kept.
Final: Baseline 8/12 (67%) → Final 11/12 (92%). 3 experiments, 2 kept. Saved as search-content-v2.
create_skill.testing
--- name: review-council description: Convene a council of expert AI personas to review, stress-test, and improve any document, idea, proposal, or plan. Use this skill whenever the user asks to "review," "stress-test," "get feedback on," "critique," "poke holes in," "red team," "evaluate," "council," "panel review," or "get perspectives on" any content — whether it's an uploaded Word doc, Excel spreadsheet, PowerPoint deck, PDF, or just a raw idea typed into chat. Also trigger on phrases like "w
tools
Generates a polished, self-contained HTML heatmap scorecard — a weighted comparison matrix where entities (rows) are scored across dimensions (columns), with computed totals, rank badges, and a winner highlight. Use when asked to build a scorecard, comparison matrix, decision matrix, vendor evaluation, tool assessment, candidate scoring grid, competitive analysis, site-readiness matrix, or any weighted multi-criteria ranking. Interviews the user if entities or criteria are missing, constructs a validated JSON document, then renders it into a sandbox-safe HTML file using the component library. No external dependencies — output runs inside a SharePoint sandboxed iframe.
development
Generates a polished, self-contained HTML roadmap or milestone timeline from any project data — SharePoint lists, pasted tables, or a verbal description. Use when asked to build a project roadmap, product roadmap, migration timeline, release plan, onboarding sequence, run-of-show, phase plan, or any visual schedule showing items over time. Interviews the user if data is incomplete, constructs a validated JSON document, then renders it into a single sandbox-safe HTML file. Chooses between two layouts automatically: horizontal roadmap with swimlanes (for phase-range data) or vertical milestone list (for point-in-time events). No external dependencies — output runs inside a SharePoint sandboxed iframe.
development
Generates a polished, self-contained HTML executive report or dashboard from any data source — SharePoint lists, CSV exports, or a user description. Use when asked to build an exec report, one-pager, summary page, status dashboard, project summary, business review, or any single-page visual summary of data. Interviews the user if data is incomplete, constructs a validated JSON document block by block, then renders it into a single sandbox-safe HTML file using the component library. No external dependencies — output runs inside a SharePoint sandboxed iframe.