Skills/improve-skills/SKILL.md
Sets up persistent eval suites for any AI in SharePoint skill and runs self-healing improvement loops on demand. Use this when setting up evals for a skill for the first time, or when the user says "improve [skill-name]", "fix [skill-name]", or "run evals for [skill-name]" to fix a skill that has drifted, started failing, or needs to handle new patterns.
npx skillsauth add zrosenfield/sharepoint-ai-skills improve-skillsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill does two things:
references/evals.md so they're always there when you need them.Choose the mode based on what the user said and what already exists.
Use SETUP when:
references/evals.md yetUse IMPROVE when:
references/evals.mdIf the mode is unclear, ask: "Do you want to set up evals for the first time, or run the improvement loop against existing evals?"
Guide the user through five questions in sequence. Do not move to the next question until the current one is confirmed.
Ask which skill they want to set up evals for, and confirm the exact skill name.
Load the skill using load_skill so you can read it before continuing. Summarize what the skill does in two or three sentences so the user can confirm you understand it correctly.
Explain that instead of fixed test cases, the skill will pull real items from a SharePoint library or list every time it runs. This keeps tests grounded in current data and lets the skill adapt as the content evolves.
Ask:
If the user wants to provide sample text directly instead of pointing at a library, accept that — note it in the data source field as "user-provided samples" and remind them they will need to paste content at the start of each IMPROVE session.
If the library is empty when IMPROVE runs later, the skill must stop immediately, tell the user the library has no content, and ask them to either add real items or paste sample text to continue.
Explain that you need three to six binary yes-or-no checks that define what a good output looks like for this skill on this kind of content. Before writing evals, read two or three items from the data source so the evals are grounded in what the content actually looks like — not a hypothetical version of it.
Share these rules for writing good evals:
For each eval, capture:
Work collaboratively with the user to write these. Propose drafts based on your reading of the skill, then refine together.
Ask three things:
Present the complete eval suite in the format shown in references/evals-template.md. Read that file now so you use the correct format.
Ask the user to confirm or make corrections. Then save the eval suite by calling create_skill with the target skill name and the file path references/evals.md.
Tell the user: "Evals saved. Anytime you want to improve this skill, just say 'improve [skill-name]' and I'll load these evals and run the loop automatically."
Load two things before proceeding:
load_skillload_skill with the file path references/evals.mdParse the eval suite to extract: the data source configuration, the eval criteria, the configuration (items per session, runs per experiment, budget cap), and the improvement history table.
If references/evals.md does not exist, tell the user and switch to SETUP mode.
Pull fresh items from the data source every session — do not reuse items from a previous run.
Read items from the library or list specified in evals.md using the configured sampling method (newest, random, or user-picks). Pull the configured number of items (default: three, max: five).
If the library is empty: Stop immediately. Tell the user the library has no content and ask them to either add items or paste sample text directly. Do not proceed until you have real input to work with.
If the user provides sample text instead of a library: Accept the pasted content as this session's inputs. Note that these are user-provided samples, not live library data.
Present the sampled items to the user with a brief summary of each (title and a one-sentence description of the content). Ask: "Are these representative of what this skill typically handles? If not, tell me what's missing and I'll adjust." Wait for confirmation before continuing.
Before running a single experiment, look at the sampled items and ask whether the stored evals still reflect what good output looks like for this content.
Check for:
Present your assessment concisely: which evals still look right, which look stale, and what (if anything) seems missing. Ask the user to confirm, update, or leave the evals as-is before proceeding.
If evals are updated, save the revised references/evals.md before running the loop.
Each iteration of the loop follows five phases: Reason, Act, Look, Probe, Harden.
On the first iteration (baseline), do not mutate anything yet. State clearly that you are establishing a baseline with the unmodified skill. Confirm the eval suite with the user and ask them to flag any corrections before you begin scoring.
On subsequent iterations, review the failure patterns identified in the previous Probe phase. Identify the single highest-impact gap and form one mutation hypothesis. Do not try to fix multiple things at once — you cannot know what helped if you change several things simultaneously.
Present your reasoning:
Apply the current skill's instructions to each sampled item and generate an output for each.
On iteration one, apply the unmodified skill exactly as written. On subsequent iterations, apply the version with your mutation applied.
State each item's title or label clearly before generating its output so the results are easy to follow.
Score every output against every eval criterion. Follow this scoring protocol exactly:
First, announce your bias before scoring: "I generated these outputs so I'm naturally biased toward passing them. I will score strictly and require explicit evidence for every pass."
Second, score all evals across all inputs before analyzing patterns. Do not let what you see in the pattern analysis change individual scores retroactively.
Third, require explicit evidence for every pass. You must cite the specific text in the output that satisfies the eval. "The output seems to address it" is not evidence. "The second sentence says X, which directly answers the question" is evidence.
Fourth, default to fail on ambiguity. If you are unsure whether an eval passes, score it as fail.
Present each scored output in this structure: the input label, then each eval with its verdict, the evidence you are citing, and a one-sentence reason. End each input's scores with its subtotal.
After all inputs are scored, present the aggregate total score and the percentage of the maximum possible score.
The maximum possible score equals the number of evals multiplied by the number of runs per experiment multiplied by the number of sampled items this session.
Look across all scored outputs and identify failure patterns:
Identify one mutation to try next. Be specific about what part of the skill you would change and why you believe it will help.
If you have had three consecutive experiments targeting the same eval with no improvement, stop and try a fundamentally different mutation type. If adding instructions has not worked, try adding an example. If examples have not worked, try removing or simplifying something instead.
Decide whether to keep or discard this experiment's mutation.
Minimum threshold to keep: The score must improve by at least two points, or by at least ten percent relative to the previous best, whichever is smaller. A one-point gain across three runs is likely noise.
Regression check: Before keeping, check whether any individual eval dropped compared to the previous kept version. If any eval dropped by two or more points, flag it as a regression — even if the total score went up.
Apply these rules:
If keeping: update your working version of the skill with the mutation applied. If discarding: revert to the version before this mutation.
Present a compact experiment log entry and an updated running progress table after every experiment. See references/evals-template.md for the format.
Stop the improvement loop when any of these conditions are met:
When the loop terminates:
create_skill with the target skill name. This replaces the original — SharePoint versioning preserves the history so nothing is lost.references/evals.md and save it back using create_skillreferences/evals.md at the end of every session.testing
--- name: review-council description: Convene a council of expert AI personas to review, stress-test, and improve any document, idea, proposal, or plan. Use this skill whenever the user asks to "review," "stress-test," "get feedback on," "critique," "poke holes in," "red team," "evaluate," "council," "panel review," or "get perspectives on" any content — whether it's an uploaded Word doc, Excel spreadsheet, PowerPoint deck, PDF, or just a raw idea typed into chat. Also trigger on phrases like "w
tools
Generates a polished, self-contained HTML heatmap scorecard — a weighted comparison matrix where entities (rows) are scored across dimensions (columns), with computed totals, rank badges, and a winner highlight. Use when asked to build a scorecard, comparison matrix, decision matrix, vendor evaluation, tool assessment, candidate scoring grid, competitive analysis, site-readiness matrix, or any weighted multi-criteria ranking. Interviews the user if entities or criteria are missing, constructs a validated JSON document, then renders it into a sandbox-safe HTML file using the component library. No external dependencies — output runs inside a SharePoint sandboxed iframe.
development
Generates a polished, self-contained HTML roadmap or milestone timeline from any project data — SharePoint lists, pasted tables, or a verbal description. Use when asked to build a project roadmap, product roadmap, migration timeline, release plan, onboarding sequence, run-of-show, phase plan, or any visual schedule showing items over time. Interviews the user if data is incomplete, constructs a validated JSON document, then renders it into a single sandbox-safe HTML file. Chooses between two layouts automatically: horizontal roadmap with swimlanes (for phase-range data) or vertical milestone list (for point-in-time events). No external dependencies — output runs inside a SharePoint sandboxed iframe.
development
Generates a polished, self-contained HTML executive report or dashboard from any data source — SharePoint lists, CSV exports, or a user description. Use when asked to build an exec report, one-pager, summary page, status dashboard, project summary, business review, or any single-page visual summary of data. Interviews the user if data is incomplete, constructs a validated JSON document block by block, then renders it into a single sandbox-safe HTML file using the component library. No external dependencies — output runs inside a SharePoint sandboxed iframe.