skills/create-eval-set/SKILL.md
Create a test set CSV file for import into Copilot Studio's in-product Evaluate tab. Reads the agent's topics, instructions, and knowledge sources to generate meaningful test cases with appropriate graders (General quality, Compare meaning, Exact match, etc.). Use when the user asks to create, prepare, or generate evaluation test cases for their agent.
npx skillsauth add microsoft/skills-for-copilot-studio skills/create-eval-setInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Create a test set CSV file that can be imported into Copilot Studio's Evaluate tab for in-product agent evaluation.
Read the agent's YAML files to understand what it does:
Glob: **/agent.mcs.yml — find the agentagent.mcs.yml — get the agent's instructions, description, and capabilitiessettings.mcs.yml — check orchestration mode (generative vs classic)Glob: **/topics/*.mcs.yml — list all topicsCreate test cases that cover:
| Category | What to test | Example | |----------|-------------|---------| | Core functionality | Main topics and capabilities | Questions matching trigger phrases | | Knowledge/generative | Knowledge source responses | Questions the agent should answer from its knowledge | | System topics | Greeting, Escalation, Goodbye, Thank You, Fallback | "Hi", "I want to speak to a person", "Goodbye" | | Edge cases | Out-of-scope, ambiguous, off-topic | "Tell me a joke", "Book a flight for me" | | Boundary testing | Things the agent should NOT do | Actions beyond its capabilities |
Aim for 10–25 test cases with good coverage across categories.
The CSV import only supports two columns: question and expectedResponse. Test methods cannot be set via CSV import — they are configured in the UI after import. The default test method (General quality) is applied to all imported test cases.
Write expected responses with this in mind:
expectedResponse empty for questions that only need General quality (it works without expected responses)| Test method | What it measures | Requires expected response? | |-------------|-----------------|---------------------------| | General quality (default) | AI-graded quality: relevance, completeness, groundedness, abstention | No (but recommended as a rubric) | | Compare meaning | Semantic similarity — compares meaning/intent | Yes | | Text similarity | Cosine similarity of text | Yes, configurable pass threshold | | Exact match | Character-for-character match | Yes | | Keyword match | Response contains expected keywords/phrases | Yes (keywords added in UI) | | Capability use | Agent called expected tools/topics | Configured in UI | | Custom | Custom grader with your own instructions and labels | Configured in UI |
Write the CSV file using the Write tool. The format must be:
"question","expectedResponse"
"User question here","Expected agent response or behavioral rubric"
"Question without expected response",
| Column | Required | Description |
|--------|----------|-------------|
| question | Yes | The user message to send to the agent. Max 1,000 characters. |
| expectedResponse | No | The expected response or behavioral rubric. Leave empty if not needed. |
Important: The Testing method column is not supported on import — it is ignored. All imported test cases get the default test method (General quality). Configure other test methods in the UI after import.
.csv formatBehavioral rubric (for General quality):
"Find me a hotel in Paris","The response should include hotel recommendations in Paris with relevant details like names, locations, or prices."
Realistic reply (for Compare meaning — set method in UI after import):
"Hi there","Hello! How can I help you today?"
Exact expected text (for Exact match — set method in UI after import):
"What is 2+2?","4"
After writing the CSV, tell the user:
To import into Copilot Studio:
- Open your agent in Copilot Studio
- Go to the Evaluate tab
- Click New evaluation > Single response
- Drag or browse for the CSV file
- Review the imported test cases and adjust if needed
- Optionally add more test methods (Capability use, Custom) in the UI
- Click Evaluate to run, or Save to run later
After import, some things can only be configured in the UI:
testing
Validate Copilot Studio agent YAML files using the LSP binary's full diagnostics (YAML structure, Power Fx, schema, cross-file references). Use when the user asks to check, validate, or verify YAML files.
development
Authenticate for Copilot Studio evaluation API and SDK chat. Caches a token that is shared across run-eval and chat-sdk skills. Run this before any eval or SDK chat workflow. Requires an App Registration with MakerOperations and Copilots.Invoke permissions.
development
Run a batch test suite via the Copilot Studio Kit (Dataverse API). Uses the Power CAT Copilot Studio Kit to execute test cases against a published agent and produces pass/fail results with latencies. Requires the Kit installed in the environment, an App Registration with Dataverse permissions, and a published agent.
development
Run evaluations against a Copilot Studio agent via the Power Platform Evaluation API. Works on DRAFT agents — no publish step required. Lists test sets, starts a run, polls until complete, fetches results, and proposes YAML fixes for failures. Use when the user wants to test agent changes without publishing.