Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

langwatch/evaluate-multimodal

Name: evaluate-multimodal
Author: langwatch

skills/recipes/evaluate-multimodal/SKILL.md

npx skillsauth add langwatch/langwatch evaluate-multimodal

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Evaluate Your Multimodal Agent

This recipe helps you evaluate agents that process images, audio, PDFs, or other non-text inputs.

Step 1: Identify Modalities

Read the codebase to understand what your agent processes:

Images: classification, analysis, generation, OCR
Audio: transcription, voice agents, audio Q&A
PDFs/Documents: parsing, extraction, summarization
Mixed: multiple input types in one pipeline

Step 2: Read the Relevant Docs

Use the LangWatch MCP:

fetch_scenario_docs → search for multimodal pages (image analysis, audio testing, file analysis)
fetch_langwatch_docs → search for evaluation SDK docs

For PDF evaluation specifically, reference the pattern from python-sdk/examples/pdf_parsing_evaluation.ipynb:

Download/load documents
Define extraction pipeline
Use LangWatch experiment SDK to evaluate extraction accuracy

Step 3: Set Up Evaluation by Modality

Image Evaluation

LangWatch's LLM-as-judge evaluators can accept images. Create an evaluation that:

Loads test images
Runs the agent on each image
Uses an LLM-as-judge evaluator to assess output quality

import langwatch

experiment = langwatch.experiment.init("image-eval")

for idx, entry in experiment.loop(enumerate(image_dataset)):
    result = my_agent(image=entry["image_path"])
    experiment.evaluate(
        "llm_boolean",
        index=idx,
        data={
            "input": entry["image_path"],  # LLM-as-judge can view images
            "output": result,
        },
        settings={
            "model": "openai/gpt-5-mini",
            "prompt": "Does the agent correctly describe/classify this image?",
        },
    )

Audio Evaluation

Use Scenario's audio testing patterns:

Audio-to-text: verify transcription accuracy
Audio-to-audio: verify voice agent responses
Use fetch_scenario_docs with url for multimodal/audio-to-text.md

PDF/Document Evaluation

Follow the pattern from the PDF parsing evaluation example:

Load documents (PDFs, CSVs, etc.)
Define extraction/parsing pipeline
Evaluate extraction accuracy against expected fields
Use structured evaluation (exact match for fields, LLM judge for summaries)

File Analysis

For agents that process arbitrary files:

Use Scenario's file analysis patterns
fetch_scenario_docs with url for multimodal/multimodal-files.md

Step 4: Generate Domain-Specific Test Data

For each modality, generate or collect test data that matches the agent's actual use case:

If it's a medical imaging agent → use relevant medical image samples
If it's a document parser → use real document types the agent encounters
If it's a voice assistant → record realistic voice prompts

Step 5: Run and Iterate

Run the evaluation, review results, fix issues, re-run until quality is acceptable.

Common Mistakes

Do NOT evaluate multimodal agents with text-only metrics — use image-aware judges
Do NOT skip testing with real file formats — synthetic descriptions aren't enough
Do NOT forget to handle file loading errors in evaluations
Do NOT use generic test images — use domain-specific ones matching the agent's purpose

langwatch/evaluate-multimodal

skills/recipes/evaluate-multimodal/SKILL.md

Evaluate multimodal AI agents that process images, audio, PDFs, or other files. Sets up evaluations using LangWatch's LLM-as-judge with image inputs, Scenario's multimodal testing, and document parsing evaluation patterns. Use when your agent handles non-text inputs.

3,203 stars

testing

Updated Apr 15, 2026

$ install --global

skillsauth

npx skillsauth add langwatch/langwatch evaluate-multimodal

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 8:47 PM1.8s1 file scanned

SKILL.md

name:: evaluate-multimodal
description:: Evaluate multimodal AI agents that process images, audio, PDFs, or other files. Sets up evaluations using LangWatch's LLM-as-judge with image inputs, Scenario's multimodal testing, and document parsing evaluation patterns. Use when your agent handles non-text inputs.
license:: MIT
compatibility:: Requires LangWatch SDK and optionally @langwatch/scenario. Works with Claude Code and similar coding agents.
category:: recipe

Evaluate Your Multimodal Agent

This recipe helps you evaluate agents that process images, audio, PDFs, or other non-text inputs.

Step 1: Identify Modalities

Read the codebase to understand what your agent processes:

Images: classification, analysis, generation, OCR
Audio: transcription, voice agents, audio Q&A
PDFs/Documents: parsing, extraction, summarization
Mixed: multiple input types in one pipeline

Step 2: Read the Relevant Docs

Use the LangWatch MCP:

fetch_scenario_docs → search for multimodal pages (image analysis, audio testing, file analysis)
fetch_langwatch_docs → search for evaluation SDK docs

For PDF evaluation specifically, reference the pattern from python-sdk/examples/pdf_parsing_evaluation.ipynb:

Download/load documents
Define extraction pipeline
Use LangWatch experiment SDK to evaluate extraction accuracy

Step 3: Set Up Evaluation by Modality

Image Evaluation

LangWatch's LLM-as-judge evaluators can accept images. Create an evaluation that:

Loads test images
Runs the agent on each image
Uses an LLM-as-judge evaluator to assess output quality

import langwatch

experiment = langwatch.experiment.init("image-eval")

for idx, entry in experiment.loop(enumerate(image_dataset)):
    result = my_agent(image=entry["image_path"])
    experiment.evaluate(
        "llm_boolean",
        index=idx,
        data={
            "input": entry["image_path"],  # LLM-as-judge can view images
            "output": result,
        },
        settings={
            "model": "openai/gpt-5-mini",
            "prompt": "Does the agent correctly describe/classify this image?",
        },
    )

Audio Evaluation

Use Scenario's audio testing patterns:

Audio-to-text: verify transcription accuracy
Audio-to-audio: verify voice agent responses
Use fetch_scenario_docs with url for multimodal/audio-to-text.md

PDF/Document Evaluation

Follow the pattern from the PDF parsing evaluation example:

Load documents (PDFs, CSVs, etc.)
Define extraction/parsing pipeline
Evaluate extraction accuracy against expected fields
Use structured evaluation (exact match for fields, LLM judge for summaries)

File Analysis

For agents that process arbitrary files:

Use Scenario's file analysis patterns
fetch_scenario_docs with url for multimodal/multimodal-files.md

Step 4: Generate Domain-Specific Test Data

For each modality, generate or collect test data that matches the agent's actual use case:

If it's a medical imaging agent → use relevant medical image samples
If it's a document parser → use real document types the agent encounters
If it's a voice assistant → record realistic voice prompts

Step 5: Run and Iterate

Run the evaluation, review results, fix issues, re-run until quality is acceptable.

Common Mistakes

Do NOT evaluate multimodal agents with text-only metrics — use image-aware judges
Do NOT skip testing with real file formats — synthetic descriptions aren't enough
Do NOT forget to handle file loading errors in evaluations
Do NOT use generic test images — use domain-specific ones matching the agent's purpose

Related Skills

langwatch/tracing

development

VerifiedTrustedCommunity

Add LangWatch tracing and observability to your code. Use for both onboarding (instrument an entire codebase) and targeted operations (add tracing to a specific function or module). Supports Python and TypeScript with all major frameworks.

3,203SKILL.mdUpdated Apr 15, 2026

langwatch/scenarios

tools

VerifiedTrustedCommunity

Test your AI agent with simulation-based scenarios. Covers writing scenario test code (Scenario SDK), creating platform scenarios (CLI or MCP), and red teaming for security vulnerabilities. Auto-detects whether to use code or platform approach based on context.

3,203SKILL.mdUpdated Apr 15, 2026

langwatch/test-compliance

testing

VerifiedTrustedCommunity

Test that your AI agent stays observational and doesn't give prescriptive advice in regulated domains (healthcare, finance, legal). Creates scenario tests for boundary enforcement and red team tests for adversarial probing. Use when your agent advises but must not prescribe.

3,203SKILL.mdUpdated Apr 15, 2026

langwatch/test-compliance

langwatch/test-cli-usability

tools

VerifiedTrustedCommunity

Write scenario tests that verify your CLI tool is usable by AI agents. Ensures commands work non-interactively, provide clear output, and don't hang on prompts. Use when you want to prove your CLI is agent-friendly.

3,203SKILL.mdUpdated Apr 15, 2026

langwatch/test-cli-usability

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/langwatch/langwatch.git

# Copy into Claude Code skills folder (global)
cp -r langwatch/skills/recipes/evaluate-multimodal ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

langwatch/langwatch

3,203 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT