Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

tylerjrbuell/quality-assurance

Name: quality-assurance
Author: tylerjrbuell

apps/docs/skills/quality-assurance/SKILL.md

npx skillsauth add tylerjrbuell/reactive-agents-ts quality-assurance

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Quality Assurance

Agent objective

Produce a builder with verification enabled and the right detectors active, plus an understanding of how to run LLM-scored evals against agent output using the @reactive-agents/eval package.

When to load this skill

Agent output must be factually accurate or grounded in retrieved content
Detecting hallucinated or fabricated responses before returning them to users
Running batch evaluation of agent quality across test cases
Adding a post-reasoning reflection or self-check step to the pipeline

Implementation baseline

import { ReactiveAgents } from "@reactive-agents/runtime";

const agent = await ReactiveAgents.create()
  .withProvider("anthropic")
  .withReasoning({ defaultStrategy: "plan-execute-reflect", maxIterations: 15 })
  .withTools({ allowedTools: ["web-search", "http-get", "checkpoint"] })
  .withVerification({
    semanticEntropy: true,      // estimate output confidence via entropy
    selfConsistency: true,      // check consistency across response variations
    hallucinationDetection: true,
    hallucinationThreshold: 0.15,  // flag if hallucination score > 0.15
    passThreshold: 0.75,           // reject outputs scoring below 0.75
  })
  .withVerificationStep({ mode: "reflect" })  // add a reflection phase at the end
  .build();

Key patterns

withVerification() — runtime output checking

.withVerification()
// Enables defaults: semanticEntropy=true, factDecomposition=true,
// selfConsistency=true, nli=true, passThreshold=0.7, riskThreshold=0.5

.withVerification({
  semanticEntropy: true,        // estimate output uncertainty via entropy
  factDecomposition: true,      // decompose and verify individual claims
  multiSource: true,            // cross-reference against multiple sources (default: false)
  selfConsistency: true,        // run variations and check consistency
  nli: true,                    // natural language inference entailment check
  hallucinationDetection: false, // dedicated hallucination layer (default: false)
  hallucinationThreshold: 0.10, // score above which output is flagged (0-1)
  passThreshold: 0.70,          // overall pass threshold (0-1)
  riskThreshold: 0.50,          // outputs below this risk score are flagged
})

withVerificationStep() — post-reasoning verification pass

// Adds a dedicated verification phase after the main reasoning loop:

.withVerificationStep({ mode: "reflect" })
// Agent reflects on its own output for accuracy and completeness.
// Uses the same provider/model as the main agent.

.withVerificationStep({ mode: "loop" })
// Runs multiple verification passes until the output passes or max retries reached.

.withVerificationStep({
  mode: "reflect",
  prompt: "Check your answer for factual accuracy. Cite sources where possible.",
})
// Custom verification prompt.

Eval scoring with @reactive-agents/eval

Run LLM-scored evaluations against a dataset of test cases:

import { EvalService, EvalServiceLive, makeEvalServiceLive } from "@reactive-agents/eval";
import { Effect } from "effect";

const evalSuite = {
  name: "agent-quality",
  cases: [
    {
      id: "test-1",
      input: "What is the capital of France?",
      expectedOutput: "Paris",
      context: "Geography question",
    },
  ],
};

const program = Effect.gen(function* () {
  const evalSvc = yield* EvalService;
  const run = yield* evalSvc.runSuite(
    evalSuite,
    "my-agent-config",
    makeAgentRunner(anthropicLLM)
  );
  console.log(`Pass rate: ${run.summary.passRate * 100}%`);
  console.log(`Avg score: ${run.summary.averageScore}`);
});

await Effect.runPromise(
  Effect.provide(program, makeEvalServiceLive(anthropicLLM))
);

5 eval scoring dimensions

| Dimension | Scorer | What it measures | |-----------|--------|-----------------| | Accuracy | scoreAccuracy | Factual correctness vs expected output | | Relevance | scoreRelevance | How well the response addresses the input | | Completeness | scoreCompleteness | Coverage of required information | | Safety | scoreSafety | Absence of harmful, biased, or dangerous content | | Cost efficiency | scoreCostEfficiency | Tokens used relative to task complexity |

VerificationOptions reference

| Field | Type | Default | Notes | |-------|------|---------|-------| | semanticEntropy | boolean | true | Uncertainty estimation via entropy | | factDecomposition | boolean | true | Decompose and verify individual claims | | multiSource | boolean | false | Cross-reference multiple sources | | selfConsistency | boolean | true | Consistency across response variations | | nli | boolean | true | Natural language inference entailment | | hallucinationDetection | boolean | false | Dedicated hallucination detection layer | | hallucinationThreshold | number | 0.10 | Flag score threshold (0-1) | | passThreshold | number | 0.70 | Overall pass threshold (0-1) | | riskThreshold | number | 0.50 | Risk score threshold (0-1) |

Pitfalls

withVerification() adds LLM calls — each verification check costs additional tokens; multiSource is the most expensive option (disabled by default)
withVerificationStep() is separate from withVerification() — one adds a reasoning phase, the other adds runtime output checks; they can be used together
passThreshold: 0.7 is conservative — lower it (e.g., 0.6) for creative tasks where strict factual grounding is not required
Eval scoring via @reactive-agents/eval uses an LLM judge — the scoring model must be separate from the agent under test for unbiased results
hallucinationDetection: true adds significant latency — only enable it for high-stakes outputs

tylerjrbuell/quality-assurance

apps/docs/skills/quality-assurance/SKILL.md

Enable output verification (hallucination detection, semantic entropy, self-consistency), add post-run verification steps, and run LLM-scored evals across 5 quality dimensions.

9 stars

testing

Updated May 6, 2026

$ install --global

skillsauth

npx skillsauth add tylerjrbuell/reactive-agents-ts quality-assurance

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 6, 2026, 2:36 AM144.6s1 file scanned

SKILL.md

name:: quality-assurance
description:: Enable output verification (hallucination detection, semantic entropy, self-consistency), add post-run verification steps, and run LLM-scored evals across 5 quality dimensions.
compatibility:: Reactive Agents TypeScript projects using @reactive-agents/*
author:: reactive-agents
version:: 2.0
tier:: capability

Quality Assurance

Agent objective

Produce a builder with verification enabled and the right detectors active, plus an understanding of how to run LLM-scored evals against agent output using the @reactive-agents/eval package.

When to load this skill

Agent output must be factually accurate or grounded in retrieved content
Detecting hallucinated or fabricated responses before returning them to users
Running batch evaluation of agent quality across test cases
Adding a post-reasoning reflection or self-check step to the pipeline

Implementation baseline

import { ReactiveAgents } from "@reactive-agents/runtime";

const agent = await ReactiveAgents.create()
  .withProvider("anthropic")
  .withReasoning({ defaultStrategy: "plan-execute-reflect", maxIterations: 15 })
  .withTools({ allowedTools: ["web-search", "http-get", "checkpoint"] })
  .withVerification({
    semanticEntropy: true,      // estimate output confidence via entropy
    selfConsistency: true,      // check consistency across response variations
    hallucinationDetection: true,
    hallucinationThreshold: 0.15,  // flag if hallucination score > 0.15
    passThreshold: 0.75,           // reject outputs scoring below 0.75
  })
  .withVerificationStep({ mode: "reflect" })  // add a reflection phase at the end
  .build();

Key patterns

withVerification() — runtime output checking

.withVerification()
// Enables defaults: semanticEntropy=true, factDecomposition=true,
// selfConsistency=true, nli=true, passThreshold=0.7, riskThreshold=0.5

.withVerification({
  semanticEntropy: true,        // estimate output uncertainty via entropy
  factDecomposition: true,      // decompose and verify individual claims
  multiSource: true,            // cross-reference against multiple sources (default: false)
  selfConsistency: true,        // run variations and check consistency
  nli: true,                    // natural language inference entailment check
  hallucinationDetection: false, // dedicated hallucination layer (default: false)
  hallucinationThreshold: 0.10, // score above which output is flagged (0-1)
  passThreshold: 0.70,          // overall pass threshold (0-1)
  riskThreshold: 0.50,          // outputs below this risk score are flagged
})

withVerificationStep() — post-reasoning verification pass

// Adds a dedicated verification phase after the main reasoning loop:

.withVerificationStep({ mode: "reflect" })
// Agent reflects on its own output for accuracy and completeness.
// Uses the same provider/model as the main agent.

.withVerificationStep({ mode: "loop" })
// Runs multiple verification passes until the output passes or max retries reached.

.withVerificationStep({
  mode: "reflect",
  prompt: "Check your answer for factual accuracy. Cite sources where possible.",
})
// Custom verification prompt.

Eval scoring with @reactive-agents/eval

Run LLM-scored evaluations against a dataset of test cases:

import { EvalService, EvalServiceLive, makeEvalServiceLive } from "@reactive-agents/eval";
import { Effect } from "effect";

const evalSuite = {
  name: "agent-quality",
  cases: [
    {
      id: "test-1",
      input: "What is the capital of France?",
      expectedOutput: "Paris",
      context: "Geography question",
    },
  ],
};

const program = Effect.gen(function* () {
  const evalSvc = yield* EvalService;
  const run = yield* evalSvc.runSuite(
    evalSuite,
    "my-agent-config",
    makeAgentRunner(anthropicLLM)
  );
  console.log(`Pass rate: ${run.summary.passRate * 100}%`);
  console.log(`Avg score: ${run.summary.averageScore}`);
});

await Effect.runPromise(
  Effect.provide(program, makeEvalServiceLive(anthropicLLM))
);

5 eval scoring dimensions

VerificationOptions reference

Pitfalls

withVerification() adds LLM calls — each verification check costs additional tokens; multiSource is the most expensive option (disabled by default)
withVerificationStep() is separate from withVerification() — one adds a reasoning phase, the other adds runtime output checks; they can be used together
passThreshold: 0.7 is conservative — lower it (e.g., 0.6) for creative tasks where strict factual grounding is not required
Eval scoring via @reactive-agents/eval uses an LLM judge — the scoring model must be separate from the agent under test for unbiased results
hallucinationDetection: true adds significant latency — only enable it for high-stakes outputs

Related Skills

tylerjrbuell/reactive-agents

development

VerifiedTrustedCommunity

Orient to the Reactive Agents framework, understand the builder API shape, and select the right capability skills for your task.

9SKILL.mdUpdated Apr 21, 2026

tylerjrbuell/reactive-agents

tylerjrbuell/provider-patterns

data-ai

VerifiedTrustedCommunity

Configure per-provider behavior, understand streaming quirks, and use the 7-hook adapter system for optimal performance across LLM providers.

9SKILL.mdUpdated Apr 21, 2026

tylerjrbuell/provider-patterns

tylerjrbuell/memory-patterns

data-ai

VerifiedTrustedCommunity

Configure the 4-layer memory system with SQLite/FTS5/vec storage for persistent agent knowledge that survives sessions.

9SKILL.mdUpdated Apr 21, 2026

tylerjrbuell/memory-patterns

tylerjrbuell/cost-budget-enforcement

testing

VerifiedTrustedCommunity

Set per-request, per-session, daily, and monthly spend limits, configure rate limiting and circuit breakers, and isolate costs per user or tenant.

9SKILL.mdUpdated Apr 21, 2026

tylerjrbuell/cost-budget-enforcement

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/tylerjrbuell/reactive-agents-ts.git

# Copy into Claude Code skills folder (global)
cp -r reactive-agents-ts/apps/docs/skills/quality-assurance ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

tylerjrbuell/reactive-agents-ts

9 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT