plugins/sagemaker-ai/skills/dataset-transformation/SKILL.md
Generates code that transforms datasets between ML schemas for model training or evaluation. Use when the user says "transform", "convert", "reformat", "change the format", or when a dataset's schema needs to change to match the target format — always use this skill for format changes rather than writing inline transformation code. Supports OpenAI chat, SageMaker SFT/DPO/RLVR/RLAIF, HuggingFace preference, Bedrock Nova, VERL, and custom JSONL formats from local files or S3.
npx skillsauth add awslabs/agent-plugins dataset-transformationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Transforms a data set provided by the user into their desired format.
sdk-getting-started skill first..jsonl (JSON Lines — one JSON object per line).This skill supports two transformation purposes — training data and evaluation data — each with its own format resolution path. The purpose is determined in Step 1 of the workflow.
Resolve the target format using the reference file ../dataset-evaluation/references/strategy_data_requirements.md. When the transformation is for model training, the required format depends on both the model type (Open Weights like Llama/Qwen vs Nova) and the finetuning technique (SFT, DPO, RLVR, RLAIF) — make sure to match on both dimensions. If either the model type or technique is not yet known, ask the user before resolving the format.
When the transformation is for model evaluation, resolve the target format using this order:
references/sagemaker_dataset_formats.md. Inform the user that the format schemas are from an offline copy and may be outdated.Use whichever source you successfully access as the source of truth for the target format. Do not rely on memorized schemas.
Your first response should determine whether this transformation is for model training or model evaluation. If the context already makes this clear (e.g., the user said "I need to prep my training data" or "I need to format my eval dataset"), confirm your understanding and move on. Otherwise, ask:
"Is this dataset transformation for model training or model evaluation? This helps me look up the right target format for you."
Remember this choice — it determines how the target format is resolved in Step 3.
⏸ Wait for user.
Acknowledge the user's request and state what this skill can do:
"I can help you transform your dataset's format! Here's my plan: I will first need to understand the format of your dataset and the transformation requirements. Once I have that, I will generate a dataset transformation function that we can refine together. After the dataset transformation function is refined to your liking, I will perform the transformation task and upload it to your desired location! Does this sound good?"
⏸ Wait for user.
For this step, you need to know: what dataset format the user would like to transform their dataset from and what dataset format they would like to transform it in to. If you know this already, skip this step. If not, ask the user:
"What's the dataset format you would like to transform it into?"
Resolve the target format based on the purpose determined in Step 1:
"I've found a SageMaker dataset format: {sagemaker-dataset-format-name} with schema: {sagemaker-dataset-format-schema}. Is this what you were referring to?"
If the user describes a custom format not listed in the reference doc, ask them to provide a sample record of the desired output format.
⏸ Wait for user.
For this step, you need: the location of the user's dataset. If you know this already, skip this step. If not, ask the user:
"Where can I find your dataset? Either a local directory or S3 location works!"
⏸ Wait for user.
Read 1–2 sample records from the user's dataset and show them so the user can confirm the source schema. Do not run format detection — that is handled by the planning skill before this skill is invoked.
Do not show a side-by-side mapping to the target format here — the detailed mapping will be handled in Step 7 when generating the transformation function.
⏸ Wait for user.
For this step, you need: to understand where to output the transformed dataset to. It could be an S3 URI or local directory If you already know where the dataset is supposed to be output to, skip this step. If not, ask the user:
"Where should I output your transformed dataset to? Either a local directory or S3 location works!"
If the user provides a directory (not a full file path), construct the output filename using the pattern {original_name}_{target_format}.jsonl (e.g., gen_qa_100k_openai.jsonl).
⏸ Wait for user.
For this step, you need: to generate a python function that transforms the dataset from the format in Step 5 to the format in Step 3
Read the reference guide at references/dataset_transformation_code.md and follow its skeleton exactly when generating the transformation function.
The python function should be in the form of:
def transform_dataset(df: pd.DataFrame) -> pd.DataFrame:
The <project-dir> is the project directory established by the directory-management skill (e.g., dpo-to-rlvr-conversion).
In notebook mode, add a %%writefile <project-dir>/scripts/transform_fn.py code cell AND write the file to disk for testing. In script mode, write the file to disk directly.
Continue iterating with the user's feedback — update the code in place on each revision rather than showing code inline.
If sample data was collected in Step 5, test the function against the sample records:
/tmp/test_input.jsonl), then run:
python3 -c "import sys; sys.path.insert(0, '<project-dir>/scripts'); from transform_fn import transform_dataset; import pandas as pd; df = pd.read_json('/tmp/test_input.jsonl', lines=True); result = transform_dataset(df); print(result.to_json(orient='records', lines=True))"If no sample data, present the function for review and refinement.
⏸ Wait for user.
If no project directory exists, activate the directory-management skill to set one up.
⏸ Wait for user.
Before writing the code, read:
references/code_output_guide.md (output format rules)code_templates/transformation.py (cell structure and skeleton code)The template uses # Cell N: Label markers — each marker starts a new section. Cell 2 (Transformation Function) is dynamically generated from Step 7; all other cells follow the template skeleton.
Generate the execution logic following the code output guide.
%%writefile <project-dir>/scripts/<script_name>.py code cell AND write the file to disk. In script mode, write the file to disk directly.transform_dataset from transform_fn.Read the reference guide at references/dataset_transformation_code.md and follow its execution script skeleton exactly.
If sample data was collected in Step 5, test the full pipeline:
/tmp/test_input.jsonl).python3 <project-dir>/scripts/<script_name> --input /tmp/test_input.jsonl --output /tmp/test_output.jsonlIf no sample data, present the notebook for review and refinement.
⏸ Wait for user.
Check the size of the input dataset:
head-object (S3 service) with the bucket and key to get ContentLength.Decision criteria:
Inform the user of the recommendation and get their approval:
If local:
"Your dataset is {size} MB — since it's under 50 MB, I'd recommend running the transformation locally. Would you like to proceed with local execution, or would you prefer a SageMaker Processing Job instead?"
If SageMaker Processing Job:
"Your dataset is {size} MB — since it's over 50 MB, I'd recommend running this as a SageMaker Processing Job for better performance. Would you like to proceed with a SageMaker Processing Job, or would you prefer to run it locally instead?"
Do not execute until the user approves. If the user rejects the recommendation, switch to the alternative and get their explicit approval before proceeding.
⏸ Wait for user.
After user confirms, add an execution cell to the notebook. Do NOT run the transformation directly (no bash, no inline python). If notebook execution tools (run_cell) are available, offer to run the cells. Otherwise, generate the cell for the user to execute themselves:
If local execution:
.py files already on disk (written by the agent during Steps 7 and 9): import transform_dataset from transform_fn, load the dataset, transform, and save output. Scripts are located in <project-dir>/scripts/.If SageMaker Processing Job:
processor.run(wait=True, logs=True) to block the cell and stream logs until the job completes. See scripts/transformation_tools.py for reference implementation details.Important: The agent must NOT execute the transformation directly via bash or inline python. If run_cell is available, use it to run the notebook cells. Otherwise, the cells are for the user to review and run. Only sample data (from Steps 7 and 9) should be transformed by the agent for validation purposes.
If
run_cellis available: "I've added the execution cell to the notebook. Would you like me to run it?" Otherwise: "I've added the execution cell to the notebook. You can run it to transform the full dataset. Would you like to review the notebook before running it?"
⏸ Wait for user.
For this step, you need: to verify the output looks correct and confirm with the user.
⏸ Wait for user to confirm.
development
Build workflows with AWS Step Functions state machines using the JSONata query language. Covers Amazon States Language (ASL) structure, state types, variables, data transformation, error handling, AWS service integration, and migrating from the JSONPath to the JSONata query language.
tools
Design, build, deploy, test, and debug serverless applications with AWS Lambda. Triggers on phrases like: Lambda function, event source, serverless application, API Gateway, EventBridge, Step Functions, serverless API, event-driven architecture, Lambda trigger. For deploying non-serverless apps to AWS, use deploy-on-aws plugin instead.
development
Validates the user's environment for SageMaker AI operations — checks SDK version, AWS region, and execution role. Use when the user says "set up", "getting started", "check my environment", "configure SDK", or as the first step in any plan involving SageMaker/Bedrock training, evaluation, or deployment.
data-ai
Selects a base model for the user's use case by querying SageMaker Hub. Use when the user asks which model to use, wants to select or change their base model, mentions a model name or family (e.g., "Llama", "Mistral", "Nova"), or wants to evaluate a base model — always activate even for known model names because the exact Hub model ID must be resolved. Queries available models, presents benchmarks and licenses, and confirms selection.