skills/create-eval/SKILL.md
Create plugin development eval scenarios (JSON files with natural prompts and deterministic checks for testing plugin skills). NOT for Copilot Studio in-product evaluation — use /copilot-studio:create-eval-set for that.
npx skillsauth add microsoft/skills-for-copilot-studio skills/create-evalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Guide the user through creating eval test cases for a Copilot Studio plugin scenario. Evals test end-to-end scenarios with natural prompts — the request routes through sub-agents (e.g., Author agent) which invoke skills internally.
The eval harness (evals/evaluate.py) works by:
claude -p "<prompt>" with a PreToolUse hook that traces skill invocations inside sub-agentsAuthoring scenarios that produce YAML files (topics, agents, knowledge sources, etc.) are the best candidates. The harness supports these check types:
| Check | What it validates | Use for |
|-------|------------------|---------|
| agent_invoked | Expected sub-agent was dispatched (e.g., Author agent) | Routing verification |
| agent_not_invoked | Unwanted sub-agents were NOT dispatched | Routing verification |
| skill_invoked | Expected skill was invoked (traced inside sub-agents via hook) | Skill routing |
| skill_not_invoked | Unwanted skills were NOT invoked | Skill routing |
| files_created | Expected files were created/modified (glob pattern) | All authoring scenarios |
| schema_validate | Full Copilot Studio schema validation (kind, required fields, IDs, Power Fx, scopes) | All YAML-producing scenarios |
| yaml_structure | Specific YAML path has expected value, min array length, or contains string | Structural assertions |
| content_contains | Keywords from prompt appear in output files | Domain relevance |
| no_placeholders | No _REPLACE, TODO, or FIXME markers left | Template completion |
| stdout_contains | CLI response text contains expected strings | Reference/info scenarios |
| stdout_not_contains | CLI response does NOT contain error strings | Error absence |
| exit_code | CLI exited with expected code | All scenarios |
| yaml_unchanged | Specific file or YAML path was NOT modified | Preservation testing |
Note: no_placeholders runs automatically when any .mcs.yml file is changed, unless explicitly set to false.
Not yet testable: Integration scenarios that call external APIs (chat-directline, manage-agent) — these need script mocking which isn't implemented yet.
Fixtures are pre-built agent directories in evals/fixtures/:
GenerativeActionsEnabled: false, one Greeting topic. Use for most authoring evals.If the scenario needs a richer agent (e.g., existing topics to modify, knowledge sources, actions), note that the fixture would need to be created first.
Identify the target scenario. If $ARGUMENTS is provided, use it as the scenario name. Otherwise ask the user what scenario they want to test (e.g., "topic creation", "agent settings", "knowledge sources").
Read relevant skill SKILL.md files to understand what the scenario covers:
Glob: skills/*/SKILL.md
Understand: What skills are involved? What YAML kinds? What files get created/modified?
Check if evals already exist:
Glob: evals/scenarios/<scenario-name>.json
If yes, read them and offer to add more test cases. Note the highest existing eval ID.
Guide the user through creating test cases. For each eval, gather:
basic-agent)Help the user define checks. Based on the scenario type:
For topic-creation scenarios:
{
"agent_invoked": "copilot-studio:Copilot Studio Author",
"skill_invoked": "copilot-studio:new-topic",
"files_created": [{"pattern": "topics/*.topic.mcs.yml", "min_count": 1}],
"schema_validate": true,
"yaml_structure": [
{"path": "kind", "equals": "AdaptiveDialog"},
{"path": "beginDialog.kind", "equals": "<trigger-type>"}
],
"content_contains": ["<domain keywords>"],
"no_placeholders": true
}
For agent-settings scenarios:
{
"agent_invoked": "copilot-studio:Copilot Studio Author",
"skill_invoked": "copilot-studio:edit-agent",
"files_created": [{"pattern": "agent.mcs.yml", "min_count": 1}],
"schema_validate": true,
"yaml_structure": [
{"path": "kind", "equals": "GptComponentMetadata"}
],
"content_contains": ["<expected content>"],
"no_placeholders": true
}
For knowledge-source scenarios:
{
"agent_invoked": "copilot-studio:Copilot Studio Author",
"skill_invoked": "copilot-studio:add-knowledge",
"files_created": [{"pattern": "knowledge/*.knowledge.mcs.yml", "min_count": 1}],
"schema_validate": true,
"no_placeholders": true
}
For reference/query scenarios:
{
"stdout_contains": ["<expected content in response>"],
"exit_code": 0
}
Recommend at least 3 test cases that cover different possibilities within the scenario. For example, for topic-creation:
Write the scenario JSON file:
Write: evals/scenarios/<scenario-name>.json
Format:
{
"scenario_name": "<scenario-name>",
"evals": [
{
"id": 1,
"name": "<short descriptive title>",
"prompt": "<natural language request — what a user would say>",
"fixture": "basic-agent",
"mock_scripts": [],
"checks": { ... }
}
]
}
Tell the user how to run the evals:
python3 evals/evaluate.py --scenario <scenario-name> --verbose
Or for all scenarios: node evals/run.js
To generate the HTML report: python3 evals/report.py evals/results/<timestamp>/
agent_invoked and skill_invoked checks to verify correct routingschema_validate: true for ALL scenarios that produce YAML — it's the most powerful checkcontent_contains keywords should come directly from the prompt to verify domain relevancetesting
Validate Copilot Studio agent YAML files using the LSP binary's full diagnostics (YAML structure, Power Fx, schema, cross-file references). Use when the user asks to check, validate, or verify YAML files.
development
Authenticate for Copilot Studio evaluation API and SDK chat. Caches a token that is shared across run-eval and chat-sdk skills. Run this before any eval or SDK chat workflow. Requires an App Registration with MakerOperations and Copilots.Invoke permissions.
development
Run a batch test suite via the Copilot Studio Kit (Dataverse API). Uses the Power CAT Copilot Studio Kit to execute test cases against a published agent and produces pass/fail results with latencies. Requires the Kit installed in the environment, an App Registration with Dataverse permissions, and a published agent.
development
Run evaluations against a Copilot Studio agent via the Power Platform Evaluation API. Works on DRAFT agents — no publish step required. Lists test sets, starts a run, polls until complete, fetches results, and proposes YAML fixes for failures. Use when the user wants to test agent changes without publishing.