plugins/sagemaker-ai/skills/model-evaluation/SKILL.md
Generates python code that evaluates SageMaker models. Supports two evaluation types: LLM-as-Judge and Custom Scorer. Use when the user says "evaluate my model", "test model performance", "how did my model perform", "compare models", or other similar requests.
npx skillsauth add awslabs/agent-plugins model-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Generate code that evaluates a SageMaker model.
.ipynb file with the complete notebook JSON, OR use notebook MCP tools (e.g., create_notebook, add_cell) if available. Do NOT use bash commands, shell scripts, or echo/cat piping to generate notebooks.This skill supports the evaluation feature for Sagemaker Serverless Model Customization. Thus it can help evaluate any Sagemaker Jumpstart models that are supported by sagemaker serverless model customization. Tell this to the user when the skill is activated:
"This skill can help us evaluate any base or finetuned model that is supported by sagemaker serverless model customization"
If the user requests help evaluating a different type of model, explain to them that this is not supported by the skill.
There are two evaluation types that can be used to evaluate a model:
Do you already know which evaluation type to use?
Check conversation history, plan.md, workflow_state.json, or anything else you've already read.
If yes: confirm with the user.
"It sounds like you want to run [evaluation type]. Is that right?"
⏸ Wait for confirmation. If confirmed → go to Step 2.
If no: ask.
"What kind of evaluation would you like to run? I support:
- LLM-as-Judge — an LLM grades your model's responses
- Custom Scorer — programmatic scoring (math, code, or your own logic)
Pick one, or say 'help me decide' if you're not sure."
⏸ Wait for user.
references/evaluation-type-guide.md and follow its instructions. It will guide the user to a choice and then return here.
You MUST NEVER make a recommendation to the user on eval type without reading references/evaluation-type-guide.md.Before reading the reference file, validate that the chosen evaluation type is compatible with the user's situation. You may already know these answers from conversation context — don't ask if you don't need to.
list-tags on the training job ARN and look for the sagemaker-studio:jumpstart-model-id tag. Contains "nova" → Nova. Anything else → OSS.describe-model-package and check the model description or source tags.If validation fails, tell the user which requirement(s) aren't met and offer alternatives:
"[Evaluation type] won't work because [reason]."
If the failure reason was lack of an eval dataset, there's nothing we can do. Inform the user:
"Unfortunately all of the supported eval types require an eval dataset. I can't help you with model evaluation."
If the failure reason is something else, offer to help them pick a different evaluation type.
⏸ Wait for user.
If they say they do want help choosing a different eval type → read references/evaluation-type-guide.md.
If validation passes, read the corresponding reference file:
| User chose | Read |
| ------------- | ---------------------------------------- |
| LLM-as-Judge | references/llmaaj-evaluation.md |
| Custom Scorer | references/custom-scorer-evaluation.md |
Follow the reference file's instructions from the beginning.
development
Deploy to AWS Elastic Beanstalk. Triggers on: elastic beanstalk, EB, managed EC2 platform, web app with managed patching, worker on EC2, Heroku alternative, don't want to manage servers or containers, migrate from Heroku, managed operational lifecycle. Covers Elastic Beanstalk on EC2 for web and worker applications.
testing
Evaluate, configure, and migrate workloads to AWS Lambda Managed Instances (LMI). Triggers on: Lambda Managed Instances, LMI, capacity provider, multi-concurrency Lambda, dedicated instance Lambda, EC2-backed Lambda, cold start elimination, Graviton Lambda, instance type for Lambda, Lambda cost optimization with Reserved Instances or Savings Plans. Also trigger when users describe high-volume predictable workloads seeking cost savings, or compare Lambda vs EC2 for steady-state traffic. For standard Lambda without LMI, use the aws-lambda skill instead.
development
Deploy applications to AWS. Triggers on phrases like: deploy to AWS, host on AWS, run this on AWS, AWS architecture, estimate AWS cost, generate infrastructure. Analyzes any codebase and deploys to optimal AWS services.
development
Build with Aurora DSQL — manage schemas, execute queries, handle migrations, diagnose query plans, load data, and develop applications with a serverless, distributed SQL database. Covers IAM auth, multi-tenant patterns, MySQL-to-DSQL migration, DDL operations, query plan explainability, SQL compatibility validation, and bulk data loading. Triggers on phrases like: DSQL, Aurora DSQL, create DSQL table, DSQL schema, migrate to DSQL, distributed SQL database, serverless PostgreSQL-compatible database, DSQL query plan, DSQL EXPLAIN ANALYZE, why is my DSQL query slow, aurora-dsql-loader, load CSV into DSQL.