resources/boost/skills/neuron-evaluation-engineer/SKILL.md
Create and run AI evaluations with datasets, assertions, and output drivers in Neuron AI. Use this skill whenever the user mentions evaluation, testing AI systems, creating evaluators, dataset-driven testing, assertion-based validation, or wants to measure AI system performance. Also trigger for tasks involving evaluator discovery, output configuration, result analysis, or building custom assertions.
npx skillsauth add neuron-core/neuron-laravel neuron-evaluation-engineerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill helps you create and run evaluations for AI systems in Neuron AI. The evaluation system provides dataset-driven testing with flexible assertions, comprehensive result reporting, and extensible output drivers.
Evaluations test AI systems using three main components:
Dataset Items → Evaluator::run() → Output → Evaluator::evaluate() → Assertions → Results
For each dataset item:
setUp() - Initialize resources (once per evaluator)run(datasetItem) - Execute your AI logicevaluate(output, datasetItem) - Assert against expected resultsNote: Each evaluation starts with a fresh assertion executor - no manual reset needed.
use NeuronAI\Evaluation\BaseEvaluator;
use NeuronAI\Evaluation\Contracts\DatasetInterface;
use NeuronAI\Evaluation\Assertions\StringContains;
use NeuronAI\Evaluation\Dataset\ArrayDataset;
use NeuronAI\Agent;
use NeuronAI\Agent\SystemPrompt;
class ContainsEvaluator extends BaseEvaluator
{
public function getDataset(): DatasetInterface
{
return new ArrayDataset([
[
'text' => 'I love this product!',
'content' => 'product',
],
[
'text' => 'This is terrible.',
'content' => 'positive',
],
]);
}
public function run(array $datasetItem): mixed
{
$response = MyAgent::make()->chat(
new UserMessage($datasetItem['text'])
)->getMessage();
return $response->getContent();
}
public function evaluate(mixed $output, array $datasetItem): void
{
$this->assert(
new StringContains($datasetItem['content']),
$output
);
}
}
For larger datasets, use JSON files:
use NeuronAI\Evaluation\Dataset\JsonDataset;
public function getDataset(): DatasetInterface
{
return new JsonDataset(__DIR__ . '/datasets/sentiment.json');
}
JSON format (sentiment.json):
[
{"text": "I love this!", "expected": "positive"},
{"text": "This is bad.", "expected": "negative"}
]
Check if the output contains a substring:
$this->assert(new StringContains('positive'), $output);
Check if the output contains all keywords:
$this->assert(new StringContainsAll(['hello', 'world']), $output);
Check if the output contains any of the keywords:
$this->assert(new StringContainsAny(['success', 'completed']), $output);
Check if the output starts with a prefix:
$this->assert(new StringStartsWith('Hello'), $output);
Check if the output ends with a suffix:
$this->assert(new StringEndsWith('!'), $output);
Check if the string length is within range:
$this->assert(new StringLengthBetween(10, 100), $output);
Check string similarity using Levenshtein distance:
$this->assert(new StringDistance(
reference: 'expected text',
threshold: 0.5, // Minimum similarity score
maxDistance: 50 // Maximum allowed edits
), $output);
Check string similarity using embeddings:
use NeuronAI\Evaluation\Assertions\StringSimilarity;
use NeuronAI\RAG\Embeddings\OpenAI\OpenAIEmbeddings;
$this->assert(new StringSimilarity(
reference: 'The quick brown fox',
embeddingsProvider: new OpenAIEmbeddings(key: 'YOUR_KEY'),
threshold: 0.6
), $output);
Match against regular expression:
$this->assert(new MatchesRegex('/^\d{3}-\d{2}-\d{4}$/'), $output);
Check if the output is valid JSON:
$this->assert(new IsValidJson(), $output);
Use an AI agent to evaluate outputs with custom criteria:
use NeuronAI\Evaluation\Assertions\AgentJudge;
use NeuronAI\Agent;
$judge = Agent::make()
->setInstructions('You are an expert evaluator for customer support responses.');
// Reference-free evaluation (criteria only)
$this->assert(new AgentJudge(
judge: $judge,
criteria: 'Response should be helpful, polite, and address the customer\'s question directly',
threshold: 0.7
), $output);
// Reference-based evaluation (compare to expected)
$this->assert(new AgentJudge(
judge: $judge,
criteria: 'The response should convey the same meaning as the reference',
threshold: 0.8,
reference: $datasetItem['expected_answer']
), $output);
// With few-shot examples for calibration
$this->assert(new AgentJudge(
judge: $judge,
criteria: 'Rate the factual accuracy of the response',
threshold: 0.7,
examples: [
[
'input' => 'What is 2+2?',
'output' => '2+2 equals 4',
'score' => 1.0,
'reasoning' => 'Mathematically correct and clear.',
],
]
), $output);
Built-in judges for common evaluation scenarios:
use NeuronAI\Evaluation\Assertions\Judges\{FaithfulnessJudge, CorrectnessJudge, RelevanceJudge, HelpfulnessJudge};
// Faithfulness - check if output is grounded in context (no hallucinations)
$this->assert(new FaithfulnessJudge(
judge: $judge,
context: $retrievedDocuments,
threshold: 0.7
), $output);
// Correctness - compare to expected answer
$this->assert(new CorrectnessJudge(
judge: $judge,
expected: $datasetItem['expected_answer'],
threshold: 0.7
), $output);
// Relevance - check if output addresses the question
$this->assert(new RelevanceJudge(
judge: $judge,
question: $datasetItem['question'],
threshold: 0.7
), $output);
// Helpfulness - evaluate utility and actionability
$this->assert(new HelpfulnessJudge(
judge: $judge,
threshold: 0.7
), $output);
use NeuronAI\Evaluation\Assertions\AbstractAssertion;
use NeuronAI\Evaluation\AssertionResult;
class GreaterThanAssertion extends AbstractAssertion
{
public function __construct(
private readonly float $threshold
) {}
public function evaluate(mixed $actual): AssertionResult
{
if (!is_numeric($actual)) {
return AssertionResult::fail(
0.0,
'Expected numeric value, got ' . gettype($actual),
);
}
if ($actual > $this->threshold) {
return AssertionResult::pass(1.0);
}
return AssertionResult::fail(
0.0,
"Expected {$actual} to be greater than {$this->threshold}",
);
}
}
Use it:
$this->assert(new GreaterThanAssertion(0.8), $score);
# Run all evaluators in a directory
vendor/bin/neuron evaluation /path/to/evaluators
# Verbose output (shows evaluator names)
vendor/bin/neuron evaluation --verbose /path/to/evaluators
# Using --path flag
vendor/bin/neuron evaluation --path=/path/to/evaluators
# Help
vendor/bin/neuron evaluation --help
use NeuronAI\Evaluation\Runner\EvaluatorRunner;
$runner = new EvaluatorRunner();
$evaluator = new MyEvaluator();
$summary = $runner->run($evaluator);
echo "Passed: {$summary->getPassedCount()}\n";
echo "Failed: {$summary->getFailedCount()}\n";
echo "Success Rate: {$summary->getSuccessRate() * 100}%\n";
Create evaluation.php in project root:
<?php
use NeuronAI\Evaluation\Output\ConsoleOutput;
use NeuronAI\Evaluation\Output\JsonOutput;
return [
'output' => [
// Simple driver (no options)
ConsoleOutput::class,
// Driver with options (class as key)
JsonOutput::class => [
'path' => 'evaluation-results.json',
],
],
];
Default behavior: If no config exists, uses ConsoleOutput.
ConsoleOutput::class => ['verbose' => true]
verbose - Show detailed input/output for failures// Write to file
JsonOutput::class => ['path' => 'results.json']
// Write to stdout
JsonOutput::class
use NeuronAI\Evaluation\Contracts\EvaluationOutputInterface;
use NeuronAI\Evaluation\Runner\EvaluatorSummary;
class DatabaseOutput implements EvaluationOutputInterface
{
public function __construct(
private readonly \PDO $pdo,
private readonly string $table = 'evaluations'
) {}
public function output(EvaluatorSummary $summary): void
{
$stmt = $this->pdo->prepare(
"INSERT INTO {$this->table}
(passed, failed, success_rate, total_time, created_at)
VALUES (?, ?, ?, ?, NOW())"
);
$stmt->execute([
$summary->getPassedCount(),
$summary->getFailedCount(),
$summary->getSuccessRate(),
$summary->getTotalExecutionTime(),
]);
}
}
Register in config:
DatabaseOutput::class => [
'pdo' => new \PDO('mysql:host=localhost;dbname=evaluations', 'user', 'pass'),
'table' => 'evaluations',
]
Add evaluators directory to composer.json:
{
"autoload-dev": {
"psr-4": {
"App\\Evaluators\\": "evaluators/"
}
}
}
project/
├── evaluators/
│ ├── SentimentEvaluator.php
│ ├── SummarizationEvaluator.php
│ └── datasets/
│ ├── sentiment.json
│ └── summarization.json
├── evaluation.php
└── vendor/bin/neuron
$summary = $runner->run($evaluator);
// Basic stats
$summary->getPassedCount(); // int
$summary->getFailedCount(); // int
$summary->getTotalCount(); // int
$summary->getSuccessRate(); // float (0.0 - 1.0)
// Timing
$summary->getTotalExecutionTime(); // float (seconds)
$summary->getAverageExecutionTime(); // float (seconds)
// Assertions
$summary->getTotalAssertions(); // int
$summary->getTotalAssertionsPassed(); // int
$summary->getTotalAssertionsFailed(); // int
$summary->getAssertionSuccessRate(); // float (0.0 - 1.0)
// Detailed results
$summary->getResults(); // array<EvaluatorResult>
$summary->getFailedResults(); // array<EvaluatorResult>
// Assertion failures grouped by location
$summary->getAssertionFailuresByLocation(); // array<string, AssertionFailure[]>
foreach ($summary->getResults() as $result) {
$result->getIndex(); // int
$result->isPassed(); // bool
$result->getInput(); // array
$result->getOutput(); // mixed
$result->getExecutionTime(); // float
$result->getError(); // ?string
$result->getAssertionsPassed(); // int
$result->getAssertionsFailed(); // int
$result->getAssertionFailures(); // array<AssertionFailure>
}
$failure->getEvaluatorClass(); // string
$failure->getShortEvaluatorClass(); // string
$failure->getAssertionMethod(); // string
$failure->getMessage(); // string
$failure->getLineNumber(); // int
$failure->getContext(); // array
$failure->getFullDescription(); // string
public function evaluate(mixed $output, array $datasetItem): void
{
$this->assert(new StringContains($datasetItem['topic']), $output);
$this->assert(new StringLengthBetween(50, 500), $output);
$this->assert(new IsValidJson(), $output);
}
Use the built-in AgentJudge assertion for AI-powered evaluation:
use NeuronAI\Evaluation\Assertions\AgentJudge;
use NeuronAI\Evaluation\Assertions\Judges\CorrectnessJudge;
public function setUp(): void
{
$this->judge = Agent::make()
->setInstructions('You are an expert evaluator for AI responses.');
}
public function evaluate(mixed $output, array $datasetItem): void
{
// Simple criteria-based evaluation
$this->assert(new AgentJudge(
judge: $this->judge,
criteria: 'Rate the quality and accuracy of the response',
threshold: 0.7
), $output);
// Or use pre-configured judges
$this->assert(new CorrectnessJudge(
judge: $this->judge,
expected: $datasetItem['expected'],
threshold: 0.7
), $output);
}
class RAGEvaluator extends BaseEvaluator
{
public function setUp(): void
{
$this->rag = new MyRAGAgent();
}
public function run(array $datasetItem): mixed
{
return $this->rag->chat(
new UserMessage($datasetItem['question'])
)->getMessage()->getContent();
}
public function evaluate(mixed $output, array $datasetItem): void
{
$this->assert(new StringContainsAny($datasetItem['key_facts']), $output);
$this->assert(new StringSimilarity(
reference: $datasetItem['expected_answer'],
embeddingsProvider: $this->embeddings,
threshold: 0.7
), $output);
}
}
public function setUp(): void
{
$this->agentA = new AgentOne();
$this->agentB = new AgentTwo();
}
public function run(array $datasetItem): mixed
{
return [
'agent_a' => $this->agentA->chat(...)->getContent(),
'agent_b' => $this->agentB->chat(...)->getContent(),
];
}
public function evaluate(mixed $output, array $datasetItem): void
{
$similarity = $this->calculateSimilarity(
$output['agent_a'],
$output['agent_b']
);
$this->assert(new GreaterThanAssertion(0.8), $similarity);
}
setUp() - Initialize expensive resources oncerun() and evaluate() pure functionsStringContains over generic checks# (Note: Neuron CLI doesn't have make:evaluator yet)
# Create evaluator manually in evaluators directory
use PHPUnit\Framework\TestCase;
use NeuronAI\Evaluation\Runner\EvaluatorRunner;
class MyEvaluatorTest extends TestCase
{
public function testEvaluatorRuns(): void
{
$runner = new EvaluatorRunner();
$evaluator = new MyEvaluator();
$summary = $runner->run($evaluator);
$this->assertGreaterThan(0, $summary->getTotalCount());
}
public function testEvaluatorHasNoFailures(): void
{
$runner = new EvaluatorRunner();
$evaluator = new MyEvaluator();
$summary = $runner->run($evaluator);
$this->assertEquals(0, $summary->getFailedCount());
}
}
name: Evaluation Tests
on: [push, pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup PHP
uses: shivammathur/setup-php@v2
with:
php-version: '8.2'
- name: Install dependencies
run: composer install
- name: Run evaluations
run: vendor/bin/neuron evaluation evaluators --verbose
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
# Run and exit with 1 if any failures
vendor/bin/neuron evaluation evaluators || exit 1
When helping users with evaluations:
Dataset format depends on:
ArrayDataset (in code)JsonDataset (files)Assertion choice depends on:
StringContains, StringStartsWithMatchesRegexStringSimilarity (embeddings)StringDistanceOutput configuration based on:
ConsoleOutput with verbose modeJsonOutput to fileEvaluation granularity:
development
Build custom Neuron AI workflows with nodes, events, middleware, and human-in-the-loop patterns. Use this skill whenever the user mentions workflows, orchestration, event-driven systems, custom agents, complex multi-step processes, human-in-the-loop patterns, or wants to build a custom agentic system from scratch. Also trigger for tasks involving node creation, event routing, workflow middleware, persistence, or interruption patterns.
tools
Create custom tools, toolkits, and MCP integrations for Neuron AI agents. Use this skill when the user mentions creating tools, building toolkits, extending Tool class, defining tool properties, implementing tool execution, MCP server integration, Model Context Protocol, connecting external tools, or tool guidelines. Also trigger for any task involving ToolProperty, ArrayProperty, ObjectProperty, AbstractToolkit, McpConnector, or StdioTransport/SseHttpTransport/StreamableHttpTransport.
tools
Write tests for Neuron AI agents, RAG systems, workflows, and tools using the built-in testing utilities. Use this skill when the user mentions testing agents, writing unit tests, mocking AI providers, testing tool execution, verifying RAG retrieval, testing workflow behavior, or creating test cases for Neuron AI components. Also trigger for any task involving PHPUnit tests, fake providers, test assertions, or quality assurance in Neuron AI projects.
data-ai
Design and implement structured output classes for Neuron AI agents using SchemaProperty attributes and validation rules. Use this skill when the user mentions structured output, JSON schema extraction, data validation, output classes, DTOs for AI responses, extracting structured data from LLM, or configuring property schemas. Also trigger for any task involving SchemaProperty attribute, validation rules like NotBlank/Email/Url, nested objects, arrays of objects, enums, polymorphic types with anyOf, or the Validator class.