Skills/deepeval/SKILL.md
Use when discussing or working with DeepEval (the python AI evaluation framework)
npx skillsauth add sammcj/agentic-coding deepevalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
DeepEval is a pytest-based framework for testing LLM applications. It provides 50+ evaluation metrics covering RAG pipelines, conversational AI, agents, safety, and custom criteria. DeepEval integrates into development workflows through pytest, supports multiple LLM providers, and includes component-level tracing with the @observe decorator.
Repository: https://github.com/confident-ai/deepeval Documentation: https://deepeval.com
pip install -U deepeval
Requires Python 3.9+.
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
def test_chatbot():
metric = AnswerRelevancyMetric(threshold=0.7, model="athropic-claude-sonnet-4-5")
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output="You have 30 days for full refund"
)
assert_test(test_case, [metric])
Run with: deepeval test run test_chatbot.py
DeepEval automatically loads .env.local then .env:
# .env
OPENAI_API_KEY="sk-..."
Evaluate both retrieval and generation phases:
from deepeval.metrics import (
ContextualPrecisionMetric,
ContextualRecallMetric,
ContextualRelevancyMetric,
AnswerRelevancyMetric,
FaithfulnessMetric
)
# Retrieval metrics
contextual_precision = ContextualPrecisionMetric(threshold=0.7)
contextual_recall = ContextualRecallMetric(threshold=0.7)
contextual_relevancy = ContextualRelevancyMetric(threshold=0.7)
# Generation metrics
answer_relevancy = AnswerRelevancyMetric(threshold=0.7)
faithfulness = FaithfulnessMetric(threshold=0.8)
test_case = LLMTestCase(
input="What are the side effects of aspirin?",
actual_output="Common side effects include stomach upset and nausea.",
expected_output="Aspirin side effects include gastrointestinal issues.",
retrieval_context=[
"Aspirin common side effects: stomach upset, nausea, vomiting.",
"Serious aspirin side effects: gastrointestinal bleeding.",
]
)
evaluate(test_cases=[test_case], metrics=[
contextual_precision, contextual_recall, contextual_relevancy,
answer_relevancy, faithfulness
])
Component-level tracing:
from deepeval.tracing import observe, update_current_span
@observe(metrics=[contextual_relevancy])
def retriever(query: str):
chunks = your_vector_db.search(query)
update_current_span(
test_case=LLMTestCase(input=query, retrieval_context=chunks)
)
return chunks
@observe(metrics=[answer_relevancy, faithfulness])
def generator(query: str, chunks: list):
response = your_llm.generate(query, chunks)
update_current_span(
test_case=LLMTestCase(
input=query,
actual_output=response,
retrieval_context=chunks
)
)
return response
@observe
def rag_pipeline(query: str):
chunks = retriever(query)
return generator(query, chunks)
Test multi-turn dialogues:
from deepeval.test_case import Turn, ConversationalTestCase
from deepeval.metrics import (
RoleAdherenceMetric,
KnowledgeRetentionMetric,
ConversationCompletenessMetric,
TurnRelevancyMetric
)
convo_test_case = ConversationalTestCase(
chatbot_role="professional, empathetic medical assistant",
turns=[
Turn(role="user", content="I have a persistent cough"),
Turn(role="assistant", content="How long have you had this cough?"),
Turn(role="user", content="About a week now"),
Turn(role="assistant", content="A week-long cough should be evaluated.")
]
)
metrics = [
RoleAdherenceMetric(threshold=0.7),
KnowledgeRetentionMetric(threshold=0.7),
ConversationCompletenessMetric(threshold=0.6),
TurnRelevancyMetric(threshold=0.7)
]
evaluate(test_cases=[convo_test_case], metrics=metrics)
Test tool usage and task completion:
from deepeval.test_case import ToolCall
from deepeval.metrics import (
TaskCompletionMetric,
ToolUseMetric,
ArgumentCorrectnessMetric
)
agent_test_case = ConversationalTestCase(
turns=[
Turn(role="user", content="When did Trump first raise tariffs?"),
Turn(
role="assistant",
content="Let me search for that information.",
tools_called=[
ToolCall(
name="WebSearch",
arguments={"query": "Trump first raised tariffs year"}
)
]
),
Turn(role="assistant", content="Trump first raised tariffs in 2018.")
]
)
evaluate(
test_cases=[agent_test_case],
metrics=[
TaskCompletionMetric(threshold=0.7),
ToolUseMetric(threshold=0.7),
ArgumentCorrectnessMetric(threshold=0.7)
]
)
Check for harmful content:
from deepeval.metrics import (
ToxicityMetric,
BiasMetric,
PIILeakageMetric,
HallucinationMetric
)
def safety_gate(output: str, input: str) -> tuple[bool, list]:
"""Returns (passed, reasons) tuple"""
test_case = LLMTestCase(input=input, actual_output=output)
safety_metrics = [
ToxicityMetric(threshold=0.5),
BiasMetric(threshold=0.5),
PIILeakageMetric(threshold=0.5)
]
failures = []
for metric in safety_metrics:
metric.measure(test_case)
if not metric.is_successful():
failures.append(f"{metric.name}: {metric.reason}")
return len(failures) == 0, failures
Retrieval Phase:
ContextualPrecisionMetric - Relevant chunks ranked higher than irrelevant onesContextualRecallMetric - All necessary information retrievedContextualRelevancyMetric - Retrieved chunks relevant to inputGeneration Phase:
AnswerRelevancyMetric - Output addresses the input queryFaithfulnessMetric - Output grounded in retrieval contextTurnRelevancyMetric - Each turn relevant to conversationKnowledgeRetentionMetric - Information retained across turnsConversationCompletenessMetric - All aspects addressedRoleAdherenceMetric - Chatbot maintains assigned roleTopicAdherenceMetric - Conversation stays on topicTaskCompletionMetric - Task successfully completedToolUseMetric - Correct tools selectedArgumentCorrectnessMetric - Tool arguments correctMCPUseMetric - MCP correctly usedToxicityMetric - Harmful content detectionBiasMetric - Biased outputs identificationHallucinationMetric - Fabricated informationPIILeakageMetric - Personal information leakageG-Eval (LLM-based):
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
custom_metric = GEval(
name="Professional Tone",
criteria="Determine if response maintains professional, empathetic tone",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
threshold=0.7,
model="anthropic-claude-sonnet-4-5"
)
BaseMetric subclass:
See references/custom_metrics.md for complete guide on creating custom metrics with BaseMetric subclassing and deterministic scorers (ROUGE, BLEU, BERTScore).
DeepEval supports OpenAI, Anthropic Claude, Google Gemini, AWS Bedrock, and 100+ providers via LiteLLM. Anthropic models are preferred.
CLI configuration (global):
deepeval set-azure-openai --openai-endpoint=... --openai-api-key=... --deployment-name=...
deepeval set-ollama deepseek-r1:1.5b
Python configuration (per-metric):
from deepeval.models import AnthropicModel, OllamaModel
anthropic_model = AnthropicModel(
model_id=settings.anthropic_model_id,
client_args={"api_key": settings.anthropic_api_key},
temperature=settings.agent_temperature
)
metric = AnswerRelevancyMetric(model=anthropic_model)
See references/model_providers.md for complete provider configuration guide.
Async mode is enabled by default. Configure with AsyncConfig and CacheConfig:
from deepeval import evaluate, AsyncConfig, CacheConfig
evaluate(
test_cases=[...],
metrics=[...],
async_config=AsyncConfig(
run_async=True,
max_concurrent=20, # Reduce if rate limited
throttle_value=0 # Delay between test cases (seconds)
),
cache_config=CacheConfig(
use_cache=True, # Read from cache
write_cache=True # Write to cache
)
)
CLI parallelisation:
deepeval test run -n 4 -c -i # 4 processes, cached, ignore errors
Best practices:
max_concurrent to 5 if hitting rate limitsevaluate() function over individual measure() callsSee references/async_performance.md for detailed performance optimisation guide.
from deepeval.dataset import EvaluationDataset, Golden
dataset = EvaluationDataset()
# From CSV
dataset.add_goldens_from_csv_file(
file_path="./test_data.csv",
input_col_name="question",
expected_output_col_name="answer",
context_col_name="context",
context_col_delimiter="|"
)
# From JSON
dataset.add_goldens_from_json_file(
file_path="./test_data.json",
input_key_name="query",
expected_output_key_name="response"
)
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
# From documents
goldens = synthesizer.generate_goldens_from_docs(
document_paths=["./docs/knowledge_base.pdf"],
max_goldens_per_document=10,
evolution_types=["REASONING", "MULTICONTEXT", "COMPARATIVE"]
)
# From scratch
goldens = synthesizer.generate_goldens_from_scratch(
subject="customer support for SaaS product",
task="answer user questions about billing",
max_goldens=20
)
Evolution types: REASONING, MULTICONTEXT, CONCRETISING, CONSTRAINED, COMPARATIVE, HYPOTHETICAL, IN_BREADTH
See references/dataset_management.md for complete dataset guide including versioning and cloud integration.
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output="You have 30 days for full refund",
expected_output="We offer 30-day full refund",
retrieval_context=["All customers eligible for 30 day refund"],
tools_called=[ToolCall(name="...", arguments={"...": "..."})]
)
from deepeval.test_case import Turn, ConversationalTestCase
convo_test_case = ConversationalTestCase(
chatbot_role="helpful customer service agent",
turns=[
Turn(role="user", content="I need help with my order"),
Turn(role="assistant", content="I'd be happy to help"),
Turn(role="user", content="It hasn't arrived yet")
]
)
from deepeval.test_case import MLLMTestCase, MLLMImage
m_test_case = MLLMTestCase(
input=["Describe this image", MLLMImage(url="./photo.png", local=True)],
actual_output=["A red bicycle leaning against a wall"]
)
# .github/workflows/test.yml
name: LLM Tests
on: [push, pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
- name: Install dependencies
run: pip install deepeval
- name: Run evaluations
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: deepeval test run tests/
Detailed implementation guides:
references/model_providers.md - Complete guide for configuring OpenAI, Anthropic, Gemini, Bedrock, and local models. Includes provider-specific considerations, cost analysis, and troubleshooting.
references/custom_metrics.md - Complete guide for creating custom metrics by subclassing BaseMetric. Includes deterministic scorers (ROUGE, BLEU, BERTScore) and LLM-based evaluation patterns.
references/async_performance.md - Complete guide for optimising evaluation performance with async mode, caching, concurrency tuning, and rate limit handling.
references/dataset_management.md - Complete guide for dataset loading, saving, synthetic generation, versioning, and cloud integration with Confident AI.
retrieval_context for RAG, expected_output for G-Eval)@observe for individual partsdeepeval test runAvoid:
Do:
development
Use when answering questions from this machine-learning knowledge base. Triggers: questions about transformers, attention cost and efficiency, and long-context scaling; 'what do we know about attention', 'check the ML wiki'. Read-only querying of compiled knowledge; to add, update, supersede, lint, or audit, use the llm-wiki skill instead.
development
Use when building or maintaining a self-contained personal knowledge base (an LLM wiki) as plain markdown, optionally opened as an Obsidian vault. Triggers: ingesting sources into a wiki, querying wiki knowledge, linting wiki health, auditing article claims against their sources, superseding stale knowledge, 'add to wiki', or any mention of 'LLM wiki' or 'Karpathy wiki'.
tools
Provides guidance and tools for hardware design. Activate when using KiCAD, looking up electronic parts or designing PCBs.
testing
Grilling session that challenges your plan against the existing domain model, sharpens terminology, and updates documentation (CONTEXT.md, ADRs) inline as decisions crystallise.