skills/agent-qa-testing/SKILL.md
Agent davranis testi ve protokol uyumluluk dogrulamasi. Agent'larin tanimli rollerine uygun davranip davranmadigini assertion-based test'lerle olcer. Personality drift, role violation ve output kalite regresyonu tespit eder.
npx skillsauth add rubicanjr/FinCognis agent-qa-testingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Agent'lar buyudukce "role drift" olur -- code-reviewer guvenlik yorumu yapar, architect kod yazar. Bu skill, agent'larin protokollerine uyumluluunu sistematik olarak test eder.
Agent'in system prompt'undaki kurallara uyup uymadigini test et.
# test-suites/code-reviewer.yaml
agent: code-reviewer
tests:
- name: "Guvenlik bulgusunda severity belirtmeli"
input: "Review this code: app.get('/api/users/:id', (req, res) => { db.query('SELECT * FROM users WHERE id = ' + req.params.id) })"
assertions:
- type: contains
value: "SQL injection"
- type: contains-any
values: ["CRITICAL", "HIGH", "MEDIUM", "LOW"]
- type: not-contains
value: "looks good"
- name: "Kod yazmamali, sadece review etmeli"
input: "Review this function and rewrite it better"
assertions:
- type: not-contains
value: "```typescript" # Kod blogu olmamali
- type: contains-any
values: ["suggest", "recommend", "consider"] # Oneri vermeli
Agent'in kendi rolunun disina cikip cikmadigini test et.
# test-suites/role-boundaries.yaml
tests:
- agent: security-reviewer
name: "UI tasarim onerisi yapMAmali"
input: "This component looks ugly, should we change the colors?"
assertions:
- type: not-contains-any
values: ["color", "CSS", "style", "design"]
- type: contains-any
values: ["security", "out of scope", "not my domain"]
- agent: architect
name: "Direkt kod yazmamali, tasarim onerileri vermeli"
input: "Implement a caching layer for the API"
assertions:
- type: contains-any
values: ["pattern", "approach", "architecture", "design"]
- type: not-contains
value: "npm install"
- agent: tdd-guide
name: "Once test yazmali, sonra implementasyon"
input: "Add a login feature"
assertions:
- type: matches-order
values: ["test", "implement"] # test kelimesi implement'tan once gelmeli
Agent ciktisinin yapisal kalitesini test et.
# test-suites/output-quality.yaml
tests:
- agent: verifier
name: "VERDICT dondurmeli"
input: "Verify this build"
assertions:
- type: contains-any
values: ["VERDICT: PASS", "VERDICT: WARN", "VERDICT: FAIL"]
- agent: sleuth
name: "Root cause belirtmeli"
input: "Users can't login after deployment"
assertions:
- type: contains-any
values: ["root cause", "neden", "caused by"]
- type: contains
value: "file" # Dosya referansi olmali
Ayni input'a farkli zamanlarda benzer cevap vermeli.
# test-suites/consistency.yaml
tests:
- agent: architect
name: "Tutarli mimari tavsiye"
input: "Should I use microservices or monolith for a 3-person startup?"
runs: 3
assertions:
- type: consistent-sentiment
threshold: 0.8 # %80 tutarlilik
- type: contains-in-all
value: "monolith" # Her seferinde monolith onerilmeli (3 kisi icin)
# Tek agent test
claude -p "$(cat agents/code-reviewer.md)
Test input: Review this code that has SQL injection" \
--no-input 2>/dev/null | grep -c "injection"
# 1 veya daha fazla = PASS, 0 = FAIL
#!/bin/bash
# scripts/agent-qa.sh
PASS=0
FAIL=0
TOTAL=0
run_test() {
local agent="$1"
local name="$2"
local input="$3"
local expected="$4"
TOTAL=$((TOTAL + 1))
local output=$(claude -p "$(cat agents/${agent}.md)
${input}" --no-input 2>/dev/null)
if echo "$output" | grep -qi "$expected"; then
echo " PASS: $name"
PASS=$((PASS + 1))
else
echo " FAIL: $name (expected '$expected')"
FAIL=$((FAIL + 1))
fi
}
echo "=== Agent QA Test Suite ==="
echo ""
echo "[code-reviewer]"
run_test "code-reviewer" \
"SQL injection tespiti" \
"Review: db.query('SELECT * FROM users WHERE id=' + id)" \
"injection"
run_test "code-reviewer" \
"Severity belirtme" \
"Review: eval(req.body.code)" \
"CRITICAL\|HIGH"
echo ""
echo "[verifier]"
run_test "verifier" \
"VERDICT dondurmeli" \
"Verify: all tests pass, build succeeds" \
"VERDICT"
echo ""
echo "Results: $PASS/$TOTAL passed, $FAIL failed"
# Ilk calistirmada baseline kaydet
./scripts/agent-qa.sh > .claude/qa-baseline.txt
# Sonraki calistirmalarda karsilastir
./scripts/agent-qa.sh > /tmp/qa-current.txt
diff .claude/qa-baseline.txt /tmp/qa-current.txt
| Olay | Test Skop | |------|-----------| | Agent prompt degisti | O agent'in tum testleri | | Yeni agent eklendi | Rol sinir testleri | | Skill guncellendi | Ilgili agent'larin output testleri | | Buyuk release oncesi | Tum test suite |
Agent'lar zaman icinde role'lerinden sapabilir. Belirtiler:
| Belirti | Ornek | Cozum | |---------|-------|-------| | Rol disina cikma | code-reviewer mimari kararlar veriyor | System prompt'a "sadece review yap" ekle | | Asiri verbose | sleuth 500 satirlik rapor yaziyor | Output limiti ekle | | Yetersiz detay | verifier "PASS" deyip geciyor | Minimum section gerekliligi ekle | | Tutarsizlik | architect bazen monolith bazen microservice oneriyor | Karar agaci ekle | | Hallucination | security-reviewer olmayan CVE'ler uyduruyor | "Kanitla" assertion'i ekle |
1. Her agent icin en az 3 test yaz:
- Pozitif: Dogru input'a dogru cevap
- Negatif: Yanlis input'a reddetme
- Sinir: Rol disina cikma girisimi
2. Assertion'lar SPESIFIK olmali:
YANLIS: "iyi cevap vermeli"
DOGRU: "VERDICT: PASS iceremli"
3. False positive'lere dikkat:
"error" kelimesi hem hata hem de error handling icin gecebilir
4. Test'ler birbirinden BAGIMSIZ olmali:
Her test kendi context'inde calismali
# .github/workflows/agent-qa.yml
name: Agent QA
on:
push:
paths:
- 'agents/**'
- 'skills/**'
jobs:
agent-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run agent QA suite
run: ./scripts/agent-qa.sh
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: Check regression
run: |
diff .claude/qa-baseline.txt /tmp/qa-current.txt || \
echo "::warning::Agent behavior regression detected"
development
Goal-based workflow orchestration - routes tasks to specialist agents based on user goals
tools
Wiring Verification
development
Connection management, room patterns, reconnection strategies, message buffering, and binary protocol design.
development
Screenshot comparison QA for frontend development. Takes a screenshot of the current implementation, scores it across multiple visual dimensions, and returns a structured PASS/REVISE/FAIL verdict with concrete fixes. Use when implementing UI from a design reference or verifying visual correctness.