plugins/microsoft-365-agents-toolkit/skills/m365-agent-evaluator/SKILL.md
Use this skill when a user wants to create, run, or analyze evaluation suites for Microsoft 365 Copilot declarative agents with the public @microsoft/m365-copilot-eval CLI. Trigger on intents such as "evaluate my agent", "test my agent", "run my evals", "create eval prompts", "add multi-turn tests", "tune evaluator thresholds", "why is my agent failing", or "set up eval environment variables".
npx skillsauth add microsoft/work-iq m365-agent-evaluatorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill to help users evaluate Microsoft 365 Copilot declarative agents with @microsoft/m365-copilot-eval. The skill designs schema-compatible eval datasets, runs the public preview CLI, analyzes results, and recommends targeted fixes.
Default to Microsoft 365 Agents Toolkit (ATK) projects when detected, but do not hard-stop solely because the current directory is not ATK. The CLI can also evaluate deployed agents with an explicit M365_AGENT_ID or --m365-agent-id.
npx -y --package @microsoft/m365-copilot-eval@latest runevals
Do not recommend the old private aka.ms installer, global installs, bare runevals, bare npx runevals, --input, or --html.
references/workflow.md for the end-to-end operator workflow and CLI commands.references/azure-setup.md for prerequisites, env files, and secret handling.references/eval-templates.md when creating or editing eval datasets.references/pra-framework.md when deciding what scenarios to generate.references/result-analysis.md after JSON/CSV/HTML results exist.references/guardrails.md before writing files, handling secrets, clearing cache, signing out, or troubleshooting..env.local, .env.local.user, env\.env.local.user, m365agents.yml, or appPackage\declarativeAgent.json.M365_AGENT_ID, --m365-agent-id, or a named environment file such as env\.env.dev.TENANT_ID, Azure OpenAI in Foundry Models endpoint/key, and recommended/default gpt-4o-mini deployment.evals\evals.json.Generate schema version 1.2.0 documents with a root items array. Do not generate the old PromptsObject or root prompts format.
Minimum shape:
{
"schemaVersion": "1.2.0",
"metadata": {
"name": "Agent evaluation suite",
"tags": ["starter"]
},
"default_evaluators": {
"Relevance": {},
"Coherence": {}
},
"items": [
{
"prompt": "What can this agent help me with?",
"expected_response": "The agent explains its supported scope without inventing unsupported capabilities."
}
]
}
Use references\prompts-schema.json as the local schema source and references\eval-templates.md for copyable single-turn, multi-turn, evaluator, and threshold examples.
Evaluator names are case-sensitive. Use only the public configurable evaluator names unless a newer authoritative source proves otherwise.
| Evaluator | Semantics |
|---|---|
| Relevance | LLM score from 1-5; default threshold 3. |
| Coherence | LLM score from 1-5; default threshold 3. |
| Groundedness | LLM score from 1-5 against context/expected evidence; default threshold 3. |
| Similarity | LLM score from 1-5 against expected_response; default threshold 3. |
| Citations | Count-based citation check; default threshold 1. |
| ExactMatch | Boolean exact string match. |
| PartialMatch | String similarity from 0.0-1.0; default threshold 0.5. |
Treat ToolCallAccuracy as legacy/private for authoring. Do not add it to generated datasets unless current public CLI/schema documentation explicitly reintroduces it.
# Version/help checks
npx -y --package @microsoft/m365-copilot-eval@latest runevals --version
npx -y --package @microsoft/m365-copilot-eval@latest runevals --help
# First-time setup / EULA
npx -y --package @microsoft/m365-copilot-eval@latest runevals accept-eula
npx -y --package @microsoft/m365-copilot-eval@latest runevals --init-only
# Batch run with explicit JSON output
npx -y --package @microsoft/m365-copilot-eval@latest runevals --prompts-file evals\evals.json --output .evals\results.json
# Human-review HTML or spreadsheet-friendly CSV
npx -y --package @microsoft/m365-copilot-eval@latest runevals --prompts-file evals\evals.json --output .evals\results.html
npx -y --package @microsoft/m365-copilot-eval@latest runevals --prompts-file evals\evals.json --output .evals\results.csv
# Quick checks
npx -y --package @microsoft/m365-copilot-eval@latest runevals --prompts "What can you help me with?" --expected "The agent describes its supported scope."
# Non-ATK or named environment
npx -y --package @microsoft/m365-copilot-eval@latest runevals --prompts-file evals\evals.json --m365-agent-id <agent-id> --env dev
Use --concurrency only with values 1-5. Start with 1 for debugging and increase only after setup is stable.
Before diagnosing agent behavior, confirm which executable is running:
Get-Command runevals -All
npm list -g @microsoft/m365-copilot-eval --depth=0
npm view @microsoft/m365-copilot-eval version
npx -y --package @microsoft/m365-copilot-eval@latest runevals --version
npx -y --package @microsoft/m365-copilot-eval@latest where runevals
If bare runevals prints This version of the M365 Evals CLI has stopped working and must be updated, treat it as a stale PATH/global install. Re-run with the npx --package ...@latest command above, then ask before removing global shims with npm uninstall -g @microsoft/m365-copilot-eval.
| Path | Purpose |
|---|---|
| .env.local | Non-secret ATK config such as M365_TITLE_ID. |
| .env.local.user or env\.env.local.user | Local secrets such as tenant ID and Azure OpenAI key. |
| env\.env.<environment> | Named environment config for non-ATK or explicit --env workflows. |
| evals\evals.json | Source-controlled eval dataset if the user wants it committed. |
| .evals\ | Local run outputs; usually gitignored. |
Never print or commit secrets, prompts containing sensitive data, retrieved content, debug logs, or raw result files unless the user explicitly asks and confirms the data is safe to share.
Use PRA as a scenario-design framework:
Relevance, Coherence, Similarity, ExactMatch, or PartialMatch; do not use legacy ToolCallAccuracy.Ask before overwriting an existing dataset. When writing generated evals, write to a temporary file first and rename on success.
Analyze only evaluator keys that are present. Missing score keys usually mean the evaluator was not configured for that item, not that it failed.
Use current score keys when present: relevance, coherence, groundedness, similarity, citations, exactMatch, and partialMatch. Group failures into likely root causes: instruction issue, grounding issue, citation issue, expected-answer mismatch, capability gap, auth/environment issue, or eval-quality issue.
Do not run real tenant-dependent evals unless the user has provided or approved the necessary tenant, agent, and Azure OpenAI configuration.
tools
Sub-skill of microsoft-365-agents-toolkit. Routed expert system with 100+ micro-expert files for migrating Slack bots to Teams, cross-platform bridging, and dual-platform bot development. USE FOR: migrating Slack bot to Teams, adding Teams support to Slack bot, building dual-platform bots, converting Block Kit to Adaptive Cards, identity/OAuth bridging, deploying bots to Azure or AWS, configuring AI model providers. DO NOT USE FOR: general web development, non-bot projects, standalone Teams development without Slack (use parent skill instead).
tools
Build, test, and deploy code-based Teams apps using the M365 Agents Toolkit CLI. USE FOR: Custom Engine Agents (CEA), Teams bots, tabs, message extensions, Agents Playground local testing, Azure provision/deploy, Slack-to-Teams migration, cross-platform bot development, Block Kit to Adaptive Cards conversion, AI model integration (OpenAI/Azure/Anthropic/Bedrock). DO NOT USE FOR: declarative agents — use the `declarative-agent-developer` skill instead. Triggers: "build a teams bot", "custom engine agent", "CEA", "teams agent", "tab app", "message extension", "test with agents playground", "provision to azure", "deploy to azure", "migrate slack bot", "slack to teams", "convert block kit", "add azure openai to my bot"
tools
--- name: workiq-preview description: Preview build of WorkIQ — the full Microsoft 365 tool surface: agentic semantic queries via ask_work_iq PLUS direct, structured reads and writes for emails, meetings, calendar, documents, Teams messages, OneDrive/SharePoint files, and people. USE THIS SKILL for ANY workplace question or write action where the data lives in Microsoft 365. Read triggers, "what did [person] say", "what are [person]'s priorities", "top of mind from [person]", "what was discussed
development
Query Microsoft 365 Copilot for workplace intelligence - emails, meetings, documents, Teams messages, and people information. USE THIS SKILL for ANY workplace-related question where the answer likely exists in Microsoft 365 data. This includes questions about what someone said, shared, or communicated; meetings, emails, messages, or documents; priorities, decisions, or context from colleagues; organizational knowledge; project status; team activities; or any information that would be in Outlook, Teams, SharePoint, OneDrive, or Calendar. When in doubt about workplace context, try WorkIQ first. Trigger phrases include "what did [person] say", "what are [person]'s priorities", "top of mind from [person]", "what was discussed", "find emails about", "what meetings", "what documents", "who is working on", "what's the status of", "any updates on", etc.