Neuron AI Evaluation Engineer

This skill helps you create and run evaluations for AI systems in Neuron AI. The evaluation system provides dataset-driven testing with flexible assertions, comprehensive result reporting, and extensible output drivers.

Core Concepts

The Evaluation System

Evaluations test AI systems using three main components:

Evaluators - Test classes that define what to run and how to validate
Datasets - Test data sources (arrays, JSON files)
Assertions - Validation rules for checking outputs

Dataset Items → Evaluator::run() → Output → Evaluator::evaluate() → Assertions → Results

Evaluation Flow

For each dataset item:

setUp() - Initialize resources (once per evaluator)
run(datasetItem) - Execute your AI logic
evaluate(output, datasetItem) - Assert against expected results
Repeat for next item

Note: Each evaluation starts with a fresh assertion executor - no manual reset needed.

Creating Custom Evaluators

Basic Evaluator

use NeuronAI\Evaluation\BaseEvaluator;
use NeuronAI\Evaluation\Contracts\DatasetInterface;
use NeuronAI\Evaluation\Assertions\StringContains;
use NeuronAI\Evaluation\Dataset\ArrayDataset;
use NeuronAI\Agent;
use NeuronAI\Agent\SystemPrompt;

class ContainsEvaluator extends BaseEvaluator
{
    public function getDataset(): DatasetInterface
    {
        return new ArrayDataset([
            [
                'text' => 'I love this product!',
                'content' => 'product',
            ],
            [
                'text' => 'This is terrible.',
                'content' => 'positive',
            ],
        ]);
    }

    public function run(array $datasetItem): mixed
    {
        $response = MyAgent::make()->chat(
            new UserMessage($datasetItem['text'])
        )->getMessage();

        return $response->getContent();
    }

    public function evaluate(mixed $output, array $datasetItem): void
    {
        $this->assert(
            new StringContains($datasetItem['content']),
            $output
        );
    }
}

JSON Dataset

For larger datasets, use JSON files:

use NeuronAI\Evaluation\Dataset\JsonDataset;

public function getDataset(): DatasetInterface
{
    return new JsonDataset(__DIR__ . '/datasets/sentiment.json');
}

JSON format (sentiment.json):

[
    {"text": "I love this!", "expected": "positive"},
    {"text": "This is bad.", "expected": "negative"}
]

Built-in Assertions

String Assertions

StringContains

Check if the output contains a substring:

$this->assert(new StringContains('positive'), $output);

StringContainsAll

Check if the output contains all keywords:

$this->assert(new StringContainsAll(['hello', 'world']), $output);

StringContainsAny

Check if the output contains any of the keywords:

$this->assert(new StringContainsAny(['success', 'completed']), $output);

StringStartsWith

Check if the output starts with a prefix:

$this->assert(new StringStartsWith('Hello'), $output);

StringEndsWith

Check if the output ends with a suffix:

$this->assert(new StringEndsWith('!'), $output);

StringLengthBetween

Check if the string length is within range:

$this->assert(new StringLengthBetween(10, 100), $output);

StringDistance

Check string similarity using Levenshtein distance:

$this->assert(new StringDistance(
    reference: 'expected text',
    threshold: 0.5,      // Minimum similarity score
    maxDistance: 50          // Maximum allowed edits
), $output);

StringSimilarity

Check string similarity using embeddings:

use NeuronAI\Evaluation\Assertions\StringSimilarity;
use NeuronAI\RAG\Embeddings\OpenAI\OpenAIEmbeddings;

$this->assert(new StringSimilarity(
    reference: 'The quick brown fox',
    embeddingsProvider: new OpenAIEmbeddings(key: 'YOUR_KEY'),
    threshold: 0.6
), $output);

Pattern Assertions

MatchesRegex

Match against regular expression:

$this->assert(new MatchesRegex('/^\d{3}-\d{2}-\d{4}$/'), $output);

Structure Assertions

IsValidJson

Check if the output is valid JSON:

$this->assert(new IsValidJson(), $output);

AI Judge Assertions

AgentJudge

Use an AI agent to evaluate outputs with custom criteria:

use NeuronAI\Evaluation\Assertions\AgentJudge;
use NeuronAI\Agent;

$judge = Agent::make()
    ->setInstructions('You are an expert evaluator for customer support responses.');

// Reference-free evaluation (criteria only)
$this->assert(new AgentJudge(
    judge: $judge,
    criteria: 'Response should be helpful, polite, and address the customer\'s question directly',
    threshold: 0.7
), $output);

// Reference-based evaluation (compare to expected)
$this->assert(new AgentJudge(
    judge: $judge,
    criteria: 'The response should convey the same meaning as the reference',
    threshold: 0.8,
    reference: $datasetItem['expected_answer']
), $output);

// With few-shot examples for calibration
$this->assert(new AgentJudge(
    judge: $judge,
    criteria: 'Rate the factual accuracy of the response',
    threshold: 0.7,
    examples: [
        [
            'input' => 'What is 2+2?',
            'output' => '2+2 equals 4',
            'score' => 1.0,
            'reasoning' => 'Mathematically correct and clear.',
        ],
    ]
), $output);

Pre-configured Judges

Built-in judges for common evaluation scenarios:

use NeuronAI\Evaluation\Assertions\Judges\{FaithfulnessJudge, CorrectnessJudge, RelevanceJudge, HelpfulnessJudge};

// Faithfulness - check if output is grounded in context (no hallucinations)
$this->assert(new FaithfulnessJudge(
    judge: $judge,
    context: $retrievedDocuments,
    threshold: 0.7
), $output);

// Correctness - compare to expected answer
$this->assert(new CorrectnessJudge(
    judge: $judge,
    expected: $datasetItem['expected_answer'],
    threshold: 0.7
), $output);

// Relevance - check if output addresses the question
$this->assert(new RelevanceJudge(
    judge: $judge,
    question: $datasetItem['question'],
    threshold: 0.7
), $output);

// Helpfulness - evaluate utility and actionability
$this->assert(new HelpfulnessJudge(
    judge: $judge,
    threshold: 0.7
), $output);

Creating Custom Assertions

use NeuronAI\Evaluation\Assertions\AbstractAssertion;
use NeuronAI\Evaluation\AssertionResult;

class GreaterThanAssertion extends AbstractAssertion
{
    public function __construct(
        private readonly float $threshold
    ) {}

    public function evaluate(mixed $actual): AssertionResult
    {
        if (!is_numeric($actual)) {
            return AssertionResult::fail(
                0.0,
                'Expected numeric value, got ' . gettype($actual),
            );
        }

        if ($actual > $this->threshold) {
            return AssertionResult::pass(1.0);
        }

        return AssertionResult::fail(
            0.0,
            "Expected {$actual} to be greater than {$this->threshold}",
        );
    }
}

Use it:

$this->assert(new GreaterThanAssertion(0.8), $score);

Running Evaluations

CLI Command

# Run all evaluators in a directory
vendor/bin/neuron evaluation /path/to/evaluators

# Verbose output (shows evaluator names)
vendor/bin/neuron evaluation --verbose /path/to/evaluators

# Using --path flag
vendor/bin/neuron evaluation --path=/path/to/evaluators

# Help
vendor/bin/neuron evaluation --help

Programmatic Execution

use NeuronAI\Evaluation\Runner\EvaluatorRunner;

$runner = new EvaluatorRunner();
$evaluator = new MyEvaluator();
$summary = $runner->run($evaluator);

echo "Passed: {$summary->getPassedCount()}\n";
echo "Failed: {$summary->getFailedCount()}\n";
echo "Success Rate: {$summary->getSuccessRate() * 100}%\n";

Output Configuration

Config File

Create evaluation.php in project root:

<?php

use NeuronAI\Evaluation\Output\ConsoleOutput;
use NeuronAI\Evaluation\Output\JsonOutput;

return [
    'output' => [
        // Simple driver (no options)
        ConsoleOutput::class,

        // Driver with options (class as key)
        JsonOutput::class => [
            'path' => 'evaluation-results.json',
        ],
    ],
];

Default behavior: If no config exists, uses ConsoleOutput.

Built-in Output Drivers

ConsoleOutput

ConsoleOutput::class => ['verbose' => true]

verbose - Show detailed input/output for failures

JsonOutput

// Write to file
JsonOutput::class => ['path' => 'results.json']

// Write to stdout
JsonOutput::class

Creating Custom Output Drivers

use NeuronAI\Evaluation\Contracts\EvaluationOutputInterface;
use NeuronAI\Evaluation\Runner\EvaluatorSummary;

class DatabaseOutput implements EvaluationOutputInterface
{
    public function __construct(
        private readonly \PDO $pdo,
        private readonly string $table = 'evaluations'
    ) {}

    public function output(EvaluatorSummary $summary): void
    {
        $stmt = $this->pdo->prepare(
            "INSERT INTO {$this->table}
            (passed, failed, success_rate, total_time, created_at)
            VALUES (?, ?, ?, ?, NOW())"
        );
        $stmt->execute([
            $summary->getPassedCount(),
            $summary->getFailedCount(),
            $summary->getSuccessRate(),
            $summary->getTotalExecutionTime(),
        ]);
    }
}

DatabaseOutput::class => [
    'pdo' => new \PDO('mysql:host=localhost;dbname=evaluations', 'user', 'pass'),
    'table' => 'evaluations',
]

Project Setup

Configuring Autoloader

Add evaluators directory to composer.json:

{
    "autoload-dev": {
        "psr-4": {
            "App\\Evaluators\\": "evaluators/"
        }
    }
}

Directory Structure

project/
├── evaluators/
│   ├── SentimentEvaluator.php
│   ├── SummarizationEvaluator.php
│   └── datasets/
│       ├── sentiment.json
│       └── summarization.json
├── evaluation.php
└── vendor/bin/neuron

Result Analysis

Accessing Results

$summary = $runner->run($evaluator);

// Basic stats
$summary->getPassedCount();      // int
$summary->getFailedCount();      // int
$summary->getTotalCount();       // int
$summary->getSuccessRate();     // float (0.0 - 1.0)

// Timing
$summary->getTotalExecutionTime();      // float (seconds)
$summary->getAverageExecutionTime();    // float (seconds)

// Assertions
$summary->getTotalAssertions();           // int
$summary->getTotalAssertionsPassed();     // int
$summary->getTotalAssertionsFailed();     // int
$summary->getAssertionSuccessRate();      // float (0.0 - 1.0)

// Detailed results
$summary->getResults();                 // array<EvaluatorResult>
$summary->getFailedResults();           // array<EvaluatorResult>

// Assertion failures grouped by location
$summary->getAssertionFailuresByLocation();  // array<string, AssertionFailure[]>

EvaluatorResult

foreach ($summary->getResults() as $result) {
    $result->getIndex();              // int
    $result->isPassed();             // bool
    $result->getInput();             // array
    $result->getOutput();            // mixed
    $result->getExecutionTime();      // float
    $result->getError();             // ?string
    $result->getAssertionsPassed();   // int
    $result->getAssertionsFailed();   // int
    $result->getAssertionFailures(); // array<AssertionFailure>
}

AssertionFailure

$failure->getEvaluatorClass();        // string
$failure->getShortEvaluatorClass(); // string
$failure->getAssertionMethod();     // string
$failure->getMessage();             // string
$failure->getLineNumber();          // int
$failure->getContext();             // array
$failure->getFullDescription();    // string

Common Patterns

Evaluating Multiple Metrics

public function evaluate(mixed $output, array $datasetItem): void
{
    $this->assert(new StringContains($datasetItem['topic']), $output);
    $this->assert(new StringLengthBetween(50, 500), $output);
    $this->assert(new IsValidJson(), $output);
}

Using AI Judge for Scoring

Use the built-in AgentJudge assertion for AI-powered evaluation:

use NeuronAI\Evaluation\Assertions\AgentJudge;
use NeuronAI\Evaluation\Assertions\Judges\CorrectnessJudge;

public function setUp(): void
{
    $this->judge = Agent::make()
        ->setInstructions('You are an expert evaluator for AI responses.');
}

public function evaluate(mixed $output, array $datasetItem): void
{
    // Simple criteria-based evaluation
    $this->assert(new AgentJudge(
        judge: $this->judge,
        criteria: 'Rate the quality and accuracy of the response',
        threshold: 0.7
    ), $output);

    // Or use pre-configured judges
    $this->assert(new CorrectnessJudge(
        judge: $this->judge,
        expected: $datasetItem['expected'],
        threshold: 0.7
    ), $output);
}

Testing RAG Systems

class RAGEvaluator extends BaseEvaluator
{
    public function setUp(): void
    {
        $this->rag = new MyRAGAgent();
    }

    public function run(array $datasetItem): mixed
    {
        return $this->rag->chat(
            new UserMessage($datasetItem['question'])
        )->getMessage()->getContent();
    }

    public function evaluate(mixed $output, array $datasetItem): void
    {
        $this->assert(new StringContainsAny($datasetItem['key_facts']), $output);
        $this->assert(new StringSimilarity(
            reference: $datasetItem['expected_answer'],
            embeddingsProvider: $this->embeddings,
            threshold: 0.7
        ), $output);
    }
}

Comparing Multiple Agents

public function setUp(): void
{
    $this->agentA = new AgentOne();
    $this->agentB = new AgentTwo();
}

public function run(array $datasetItem): mixed
{
    return [
        'agent_a' => $this->agentA->chat(...)->getContent(),
        'agent_b' => $this->agentB->chat(...)->getContent(),
    ];
}

public function evaluate(mixed $output, array $datasetItem): void
{
    $similarity = $this->calculateSimilarity(
        $output['agent_a'],
        $output['agent_b']
    );
    $this->assert(new GreaterThanAssertion(0.8), $similarity);
}

Best Practices

Evaluator Design

Keep evaluators focused - One evaluator per use case
Use descriptive dataset items - Include expected values, metadata
Leverage setUp() - Initialize expensive resources once
Test in isolation - Make run() and evaluate() pure functions

Assertion Usage

Use specific assertions - Prefer StringContains over generic checks
Set appropriate thresholds - Balance sensitivity vs. false positives
Combine multiple assertions - Check different aspects of output
Use embeddings for semantic similarity - Don't rely only on string matching

Dataset Management

Separate test data - Keep evaluators in dedicated directory
Use JSON for large datasets - Easier to maintain than arrays
Include diverse cases - Edge cases, typical cases, boundary values
Version control datasets - Track changes to test cases

Output Configuration

Configure multiple drivers - Console for quick checks, JSON for CI/CD
Use verbose mode during development for detailed failure info
Custom drivers for integration with existing systems (databases, APIs)

CLI Generation

# (Note: Neuron CLI doesn't have make:evaluator yet)
# Create evaluator manually in evaluators directory

Testing Evaluators

use PHPUnit\Framework\TestCase;
use NeuronAI\Evaluation\Runner\EvaluatorRunner;

class MyEvaluatorTest extends TestCase
{
    public function testEvaluatorRuns(): void
    {
        $runner = new EvaluatorRunner();
        $evaluator = new MyEvaluator();
        $summary = $runner->run($evaluator);

        $this->assertGreaterThan(0, $summary->getTotalCount());
    }

    public function testEvaluatorHasNoFailures(): void
    {
        $runner = new EvaluatorRunner();
        $evaluator = new MyEvaluator();
        $summary = $runner->run($evaluator);

        $this->assertEquals(0, $summary->getFailedCount());
    }
}

Integration with CI/CD

GitHub Actions

name: Evaluation Tests

on: [push, pull_request]

jobs:
    evaluate:
        runs-on: ubuntu-latest
        steps:
            - uses: actions/checkout@v3
            - name: Setup PHP
              uses: shivammathur/setup-php@v2
              with:
                  php-version: '8.2'
            - name: Install dependencies
              run: composer install
            - name: Run evaluations
              run: vendor/bin/neuron evaluation evaluators --verbose
              env:
                  ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Failing on Thresholds

# Run and exit with 1 if any failures
vendor/bin/neuron evaluation evaluators || exit 1

Key Decision Points

When helping users with evaluations:

Dataset format depends on:
- Small datasets → ArrayDataset (in code)
- Large/external datasets → JsonDataset (files)
Assertion choice depends on:
- Exact matching → StringContains, StringStartsWith
- Pattern matching → MatchesRegex
- Semantic similarity → StringSimilarity (embeddings)
- Fuzzy matching → StringDistance
Output configuration based on:
- Development → ConsoleOutput with verbose mode
- CI/CD → JsonOutput to file
- Analytics → Custom driver to database/API
Evaluation granularity:
- Unit tests → Single assertion per evaluator
- Integration tests → Multiple assertions
- System tests → Multiple evaluators covering different scenarios

Neuron AI Evaluation Engineer

Core Concepts

The Evaluation System

Evaluations test AI systems using three main components:

Evaluators - Test classes that define what to run and how to validate
Datasets - Test data sources (arrays, JSON files)
Assertions - Validation rules for checking outputs

Dataset Items → Evaluator::run() → Output → Evaluator::evaluate() → Assertions → Results

Evaluation Flow

For each dataset item:

setUp() - Initialize resources (once per evaluator)
run(datasetItem) - Execute your AI logic
evaluate(output, datasetItem) - Assert against expected results
Repeat for next item

Note: Each evaluation starts with a fresh assertion executor - no manual reset needed.

Creating Custom Evaluators

Basic Evaluator

use NeuronAI\Evaluation\BaseEvaluator;
use NeuronAI\Evaluation\Contracts\DatasetInterface;
use NeuronAI\Evaluation\Assertions\StringContains;
use NeuronAI\Evaluation\Dataset\ArrayDataset;
use NeuronAI\Agent;
use NeuronAI\Agent\SystemPrompt;

class ContainsEvaluator extends BaseEvaluator
{
    public function getDataset(): DatasetInterface
    {
        return new ArrayDataset([
            [
                'text' => 'I love this product!',
                'content' => 'product',
            ],
            [
                'text' => 'This is terrible.',
                'content' => 'positive',
            ],
        ]);
    }

    public function run(array $datasetItem): mixed
    {
        $response = MyAgent::make()->chat(
            new UserMessage($datasetItem['text'])
        )->getMessage();

        return $response->getContent();
    }

    public function evaluate(mixed $output, array $datasetItem): void
    {
        $this->assert(
            new StringContains($datasetItem['content']),
            $output
        );
    }
}

JSON Dataset

For larger datasets, use JSON files:

use NeuronAI\Evaluation\Dataset\JsonDataset;

public function getDataset(): DatasetInterface
{
    return new JsonDataset(__DIR__ . '/datasets/sentiment.json');
}

JSON format (sentiment.json):

[
    {"text": "I love this!", "expected": "positive"},
    {"text": "This is bad.", "expected": "negative"}
]

Built-in Assertions

String Assertions

StringContains

Check if the output contains a substring:

$this->assert(new StringContains('positive'), $output);

StringContainsAll

Check if the output contains all keywords:

$this->assert(new StringContainsAll(['hello', 'world']), $output);

StringContainsAny

Check if the output contains any of the keywords:

$this->assert(new StringContainsAny(['success', 'completed']), $output);

StringStartsWith

Check if the output starts with a prefix:

$this->assert(new StringStartsWith('Hello'), $output);

StringEndsWith

Check if the output ends with a suffix:

$this->assert(new StringEndsWith('!'), $output);

StringLengthBetween

Check if the string length is within range:

$this->assert(new StringLengthBetween(10, 100), $output);

StringDistance

Check string similarity using Levenshtein distance:

$this->assert(new StringDistance(
    reference: 'expected text',
    threshold: 0.5,      // Minimum similarity score
    maxDistance: 50          // Maximum allowed edits
), $output);

StringSimilarity

Check string similarity using embeddings:

use NeuronAI\Evaluation\Assertions\StringSimilarity;
use NeuronAI\RAG\Embeddings\OpenAI\OpenAIEmbeddings;

$this->assert(new StringSimilarity(
    reference: 'The quick brown fox',
    embeddingsProvider: new OpenAIEmbeddings(key: 'YOUR_KEY'),
    threshold: 0.6
), $output);

Pattern Assertions

MatchesRegex

Match against regular expression:

$this->assert(new MatchesRegex('/^\d{3}-\d{2}-\d{4}$/'), $output);

Structure Assertions

IsValidJson

Check if the output is valid JSON:

$this->assert(new IsValidJson(), $output);

AI Judge Assertions

AgentJudge

Use an AI agent to evaluate outputs with custom criteria:

use NeuronAI\Evaluation\Assertions\AgentJudge;
use NeuronAI\Agent;

$judge = Agent::make()
    ->setInstructions('You are an expert evaluator for customer support responses.');

// Reference-free evaluation (criteria only)
$this->assert(new AgentJudge(
    judge: $judge,
    criteria: 'Response should be helpful, polite, and address the customer\'s question directly',
    threshold: 0.7
), $output);

// Reference-based evaluation (compare to expected)
$this->assert(new AgentJudge(
    judge: $judge,
    criteria: 'The response should convey the same meaning as the reference',
    threshold: 0.8,
    reference: $datasetItem['expected_answer']
), $output);

// With few-shot examples for calibration
$this->assert(new AgentJudge(
    judge: $judge,
    criteria: 'Rate the factual accuracy of the response',
    threshold: 0.7,
    examples: [
        [
            'input' => 'What is 2+2?',
            'output' => '2+2 equals 4',
            'score' => 1.0,
            'reasoning' => 'Mathematically correct and clear.',
        ],
    ]
), $output);

Pre-configured Judges

Built-in judges for common evaluation scenarios:

use NeuronAI\Evaluation\Assertions\Judges\{FaithfulnessJudge, CorrectnessJudge, RelevanceJudge, HelpfulnessJudge};

// Faithfulness - check if output is grounded in context (no hallucinations)
$this->assert(new FaithfulnessJudge(
    judge: $judge,
    context: $retrievedDocuments,
    threshold: 0.7
), $output);

// Correctness - compare to expected answer
$this->assert(new CorrectnessJudge(
    judge: $judge,
    expected: $datasetItem['expected_answer'],
    threshold: 0.7
), $output);

// Relevance - check if output addresses the question
$this->assert(new RelevanceJudge(
    judge: $judge,
    question: $datasetItem['question'],
    threshold: 0.7
), $output);

// Helpfulness - evaluate utility and actionability
$this->assert(new HelpfulnessJudge(
    judge: $judge,
    threshold: 0.7
), $output);

Creating Custom Assertions

use NeuronAI\Evaluation\Assertions\AbstractAssertion;
use NeuronAI\Evaluation\AssertionResult;

class GreaterThanAssertion extends AbstractAssertion
{
    public function __construct(
        private readonly float $threshold
    ) {}

    public function evaluate(mixed $actual): AssertionResult
    {
        if (!is_numeric($actual)) {
            return AssertionResult::fail(
                0.0,
                'Expected numeric value, got ' . gettype($actual),
            );
        }

        if ($actual > $this->threshold) {
            return AssertionResult::pass(1.0);
        }

        return AssertionResult::fail(
            0.0,
            "Expected {$actual} to be greater than {$this->threshold}",
        );
    }
}

Use it:

$this->assert(new GreaterThanAssertion(0.8), $score);

Running Evaluations

CLI Command

# Run all evaluators in a directory
vendor/bin/neuron evaluation /path/to/evaluators

# Verbose output (shows evaluator names)
vendor/bin/neuron evaluation --verbose /path/to/evaluators

# Using --path flag
vendor/bin/neuron evaluation --path=/path/to/evaluators

# Help
vendor/bin/neuron evaluation --help

Programmatic Execution

use NeuronAI\Evaluation\Runner\EvaluatorRunner;

$runner = new EvaluatorRunner();
$evaluator = new MyEvaluator();
$summary = $runner->run($evaluator);

echo "Passed: {$summary->getPassedCount()}\n";
echo "Failed: {$summary->getFailedCount()}\n";
echo "Success Rate: {$summary->getSuccessRate() * 100}%\n";

Output Configuration

Config File

Create evaluation.php in project root:

<?php

use NeuronAI\Evaluation\Output\ConsoleOutput;
use NeuronAI\Evaluation\Output\JsonOutput;

return [
    'output' => [
        // Simple driver (no options)
        ConsoleOutput::class,

        // Driver with options (class as key)
        JsonOutput::class => [
            'path' => 'evaluation-results.json',
        ],
    ],
];

Default behavior: If no config exists, uses ConsoleOutput.

Built-in Output Drivers

ConsoleOutput

ConsoleOutput::class => ['verbose' => true]

verbose - Show detailed input/output for failures

JsonOutput

// Write to file
JsonOutput::class => ['path' => 'results.json']

// Write to stdout
JsonOutput::class

Creating Custom Output Drivers

use NeuronAI\Evaluation\Contracts\EvaluationOutputInterface;
use NeuronAI\Evaluation\Runner\EvaluatorSummary;

class DatabaseOutput implements EvaluationOutputInterface
{
    public function __construct(
        private readonly \PDO $pdo,
        private readonly string $table = 'evaluations'
    ) {}

    public function output(EvaluatorSummary $summary): void
    {
        $stmt = $this->pdo->prepare(
            "INSERT INTO {$this->table}
            (passed, failed, success_rate, total_time, created_at)
            VALUES (?, ?, ?, ?, NOW())"
        );
        $stmt->execute([
            $summary->getPassedCount(),
            $summary->getFailedCount(),
            $summary->getSuccessRate(),
            $summary->getTotalExecutionTime(),
        ]);
    }
}

DatabaseOutput::class => [
    'pdo' => new \PDO('mysql:host=localhost;dbname=evaluations', 'user', 'pass'),
    'table' => 'evaluations',
]

Project Setup

Configuring Autoloader

Add evaluators directory to composer.json:

{
    "autoload-dev": {
        "psr-4": {
            "App\\Evaluators\\": "evaluators/"
        }
    }
}

Directory Structure

project/
├── evaluators/
│   ├── SentimentEvaluator.php
│   ├── SummarizationEvaluator.php
│   └── datasets/
│       ├── sentiment.json
│       └── summarization.json
├── evaluation.php
└── vendor/bin/neuron

Result Analysis

Accessing Results

$summary = $runner->run($evaluator);

// Basic stats
$summary->getPassedCount();      // int
$summary->getFailedCount();      // int
$summary->getTotalCount();       // int
$summary->getSuccessRate();     // float (0.0 - 1.0)

// Timing
$summary->getTotalExecutionTime();      // float (seconds)
$summary->getAverageExecutionTime();    // float (seconds)

// Assertions
$summary->getTotalAssertions();           // int
$summary->getTotalAssertionsPassed();     // int
$summary->getTotalAssertionsFailed();     // int
$summary->getAssertionSuccessRate();      // float (0.0 - 1.0)

// Detailed results
$summary->getResults();                 // array<EvaluatorResult>
$summary->getFailedResults();           // array<EvaluatorResult>

// Assertion failures grouped by location
$summary->getAssertionFailuresByLocation();  // array<string, AssertionFailure[]>

EvaluatorResult

foreach ($summary->getResults() as $result) {
    $result->getIndex();              // int
    $result->isPassed();             // bool
    $result->getInput();             // array
    $result->getOutput();            // mixed
    $result->getExecutionTime();      // float
    $result->getError();             // ?string
    $result->getAssertionsPassed();   // int
    $result->getAssertionsFailed();   // int
    $result->getAssertionFailures(); // array<AssertionFailure>
}

AssertionFailure

$failure->getEvaluatorClass();        // string
$failure->getShortEvaluatorClass(); // string
$failure->getAssertionMethod();     // string
$failure->getMessage();             // string
$failure->getLineNumber();          // int
$failure->getContext();             // array
$failure->getFullDescription();    // string

Common Patterns

Evaluating Multiple Metrics

public function evaluate(mixed $output, array $datasetItem): void
{
    $this->assert(new StringContains($datasetItem['topic']), $output);
    $this->assert(new StringLengthBetween(50, 500), $output);
    $this->assert(new IsValidJson(), $output);
}

Using AI Judge for Scoring

Use the built-in AgentJudge assertion for AI-powered evaluation:

use NeuronAI\Evaluation\Assertions\AgentJudge;
use NeuronAI\Evaluation\Assertions\Judges\CorrectnessJudge;

public function setUp(): void
{
    $this->judge = Agent::make()
        ->setInstructions('You are an expert evaluator for AI responses.');
}

public function evaluate(mixed $output, array $datasetItem): void
{
    // Simple criteria-based evaluation
    $this->assert(new AgentJudge(
        judge: $this->judge,
        criteria: 'Rate the quality and accuracy of the response',
        threshold: 0.7
    ), $output);

    // Or use pre-configured judges
    $this->assert(new CorrectnessJudge(
        judge: $this->judge,
        expected: $datasetItem['expected'],
        threshold: 0.7
    ), $output);
}

Testing RAG Systems

class RAGEvaluator extends BaseEvaluator
{
    public function setUp(): void
    {
        $this->rag = new MyRAGAgent();
    }

    public function run(array $datasetItem): mixed
    {
        return $this->rag->chat(
            new UserMessage($datasetItem['question'])
        )->getMessage()->getContent();
    }

    public function evaluate(mixed $output, array $datasetItem): void
    {
        $this->assert(new StringContainsAny($datasetItem['key_facts']), $output);
        $this->assert(new StringSimilarity(
            reference: $datasetItem['expected_answer'],
            embeddingsProvider: $this->embeddings,
            threshold: 0.7
        ), $output);
    }
}

Comparing Multiple Agents

public function setUp(): void
{
    $this->agentA = new AgentOne();
    $this->agentB = new AgentTwo();
}

public function run(array $datasetItem): mixed
{
    return [
        'agent_a' => $this->agentA->chat(...)->getContent(),
        'agent_b' => $this->agentB->chat(...)->getContent(),
    ];
}

public function evaluate(mixed $output, array $datasetItem): void
{
    $similarity = $this->calculateSimilarity(
        $output['agent_a'],
        $output['agent_b']
    );
    $this->assert(new GreaterThanAssertion(0.8), $similarity);
}

Best Practices

Evaluator Design

Keep evaluators focused - One evaluator per use case
Use descriptive dataset items - Include expected values, metadata
Leverage setUp() - Initialize expensive resources once
Test in isolation - Make run() and evaluate() pure functions

Assertion Usage

Use specific assertions - Prefer StringContains over generic checks
Set appropriate thresholds - Balance sensitivity vs. false positives
Combine multiple assertions - Check different aspects of output
Use embeddings for semantic similarity - Don't rely only on string matching

Dataset Management

Separate test data - Keep evaluators in dedicated directory
Use JSON for large datasets - Easier to maintain than arrays
Include diverse cases - Edge cases, typical cases, boundary values
Version control datasets - Track changes to test cases

Output Configuration

Configure multiple drivers - Console for quick checks, JSON for CI/CD
Use verbose mode during development for detailed failure info
Custom drivers for integration with existing systems (databases, APIs)

CLI Generation

# (Note: Neuron CLI doesn't have make:evaluator yet)
# Create evaluator manually in evaluators directory

Testing Evaluators

use PHPUnit\Framework\TestCase;
use NeuronAI\Evaluation\Runner\EvaluatorRunner;

class MyEvaluatorTest extends TestCase
{
    public function testEvaluatorRuns(): void
    {
        $runner = new EvaluatorRunner();
        $evaluator = new MyEvaluator();
        $summary = $runner->run($evaluator);

        $this->assertGreaterThan(0, $summary->getTotalCount());
    }

    public function testEvaluatorHasNoFailures(): void
    {
        $runner = new EvaluatorRunner();
        $evaluator = new MyEvaluator();
        $summary = $runner->run($evaluator);

        $this->assertEquals(0, $summary->getFailedCount());
    }
}

Integration with CI/CD

GitHub Actions

name: Evaluation Tests

on: [push, pull_request]

jobs:
    evaluate:
        runs-on: ubuntu-latest
        steps:
            - uses: actions/checkout@v3
            - name: Setup PHP
              uses: shivammathur/setup-php@v2
              with:
                  php-version: '8.2'
            - name: Install dependencies
              run: composer install
            - name: Run evaluations
              run: vendor/bin/neuron evaluation evaluators --verbose
              env:
                  ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Failing on Thresholds

# Run and exit with 1 if any failures
vendor/bin/neuron evaluation evaluators || exit 1

Key Decision Points

When helping users with evaluations:

Dataset format depends on:
- Small datasets → ArrayDataset (in code)
- Large/external datasets → JsonDataset (files)
Assertion choice depends on:
- Exact matching → StringContains, StringStartsWith
- Pattern matching → MatchesRegex
- Semantic similarity → StringSimilarity (embeddings)
- Fuzzy matching → StringDistance
Output configuration based on:
- Development → ConsoleOutput with verbose mode
- CI/CD → JsonOutput to file
- Analytics → Custom driver to database/API
Evaluation granularity:
- Unit tests → Single assertion per evaluator
- Integration tests → Multiple assertions
- System tests → Multiple evaluators covering different scenarios

Adoption

neuron-core/neuron-evaluation-engineer

$ install --global

Security Scan Results

SKILL.md

Neuron AI Evaluation Engineer

Core Concepts

The Evaluation System

Evaluation Flow

Creating Custom Evaluators

Basic Evaluator

JSON Dataset

Built-in Assertions

String Assertions

StringContains

StringContainsAll

StringContainsAny

StringStartsWith

StringEndsWith

StringLengthBetween

StringDistance

StringSimilarity

Pattern Assertions

MatchesRegex

Structure Assertions

IsValidJson

AI Judge Assertions

AgentJudge

Pre-configured Judges

Creating Custom Assertions

Running Evaluations

CLI Command

Programmatic Execution

Output Configuration

Config File

Built-in Output Drivers

ConsoleOutput

JsonOutput

Creating Custom Output Drivers

Project Setup

Configuring Autoloader

Directory Structure

Result Analysis

Accessing Results

EvaluatorResult

AssertionFailure

Common Patterns

Evaluating Multiple Metrics

Using AI Judge for Scoring

Testing RAG Systems

Comparing Multiple Agents

Best Practices

Evaluator Design

Assertion Usage

Dataset Management

Output Configuration

CLI Generation

Testing Evaluators

Integration with CI/CD

GitHub Actions

Failing on Thresholds

Key Decision Points

Related Skills

neuron-core/neuron-workflow-architect

neuron-core/neuron-tool-creator

neuron-core/neuron-test-engineer

neuron-core/neuron-structured-output

neuron-core/neuron-evaluation-engineer

$ install --global

Security Scan Results

SKILL.md

Neuron AI Evaluation Engineer

Core Concepts

The Evaluation System

Evaluation Flow

Creating Custom Evaluators

Basic Evaluator

JSON Dataset

Built-in Assertions

String Assertions