skills/recipes/evaluate-multimodal/SKILL.md
Evaluate multimodal AI agents that process images, audio, PDFs, or other files. Sets up evaluations using LangWatch's LLM-as-judge with image inputs, Scenario's multimodal testing, and document parsing evaluation patterns. Use when your agent handles non-text inputs.
npx skillsauth add langwatch/langwatch evaluate-multimodalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This recipe helps you evaluate agents that process images, audio, PDFs, or other non-text inputs.
Read the codebase to understand what your agent processes:
Use the LangWatch MCP:
fetch_scenario_docs → search for multimodal pages (image analysis, audio testing, file analysis)fetch_langwatch_docs → search for evaluation SDK docsFor PDF evaluation specifically, reference the pattern from python-sdk/examples/pdf_parsing_evaluation.ipynb:
LangWatch's LLM-as-judge evaluators can accept images. Create an evaluation that:
import langwatch
experiment = langwatch.experiment.init("image-eval")
for idx, entry in experiment.loop(enumerate(image_dataset)):
result = my_agent(image=entry["image_path"])
experiment.evaluate(
"llm_boolean",
index=idx,
data={
"input": entry["image_path"], # LLM-as-judge can view images
"output": result,
},
settings={
"model": "openai/gpt-5-mini",
"prompt": "Does the agent correctly describe/classify this image?",
},
)
Use Scenario's audio testing patterns:
fetch_scenario_docs with url for multimodal/audio-to-text.mdFollow the pattern from the PDF parsing evaluation example:
For agents that process arbitrary files:
fetch_scenario_docs with url for multimodal/multimodal-files.mdFor each modality, generate or collect test data that matches the agent's actual use case:
Run the evaluation, review results, fix issues, re-run until quality is acceptable.
development
Add LangWatch tracing and observability to your code. Use for both onboarding (instrument an entire codebase) and targeted operations (add tracing to a specific function or module). Supports Python and TypeScript with all major frameworks.
tools
Test your AI agent with simulation-based scenarios. Covers writing scenario test code (Scenario SDK), creating platform scenarios (CLI or MCP), and red teaming for security vulnerabilities. Auto-detects whether to use code or platform approach based on context.
testing
Test that your AI agent stays observational and doesn't give prescriptive advice in regulated domains (healthcare, finance, legal). Creates scenario tests for boundary enforcement and red team tests for adversarial probing. Use when your agent advises but must not prescribe.
tools
Write scenario tests that verify your CLI tool is usable by AI agents. Ensures commands work non-interactively, provide clear output, and don't hang on prompts. Use when you want to prove your CLI is agent-friendly.