Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

strongorange/hitl-eval-dashboard

Name: hitl-eval-dashboard
Author: strongorange

skills/hitl-eval-dashboard/SKILL.md

npx skillsauth add strongorange/strong-orange-agent-skills hitl-eval-dashboard

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

HITL 평가 대시보드

평가를 구조화해서 "직접 봐야 아는 것"을 재현 가능한 판단 기록으로 남기는 패턴이다.

핵심은 단순함이다: 브라우저에서 빠르게 볼 수 있는 UI, 사람의 점수·코멘트 입력, 로컬 파일 저장, 재실행 가능한 집계 리포트.

Step 1: Discovery — 먼저 읽어라

SKILL을 읽은 직후 바로 코드를 만들지 않는다.

프로젝트의 기존 실행·빌드·스크립트 관례를 읽는다 (package.json, Makefile, pyproject.toml 등).
비교할 산출물 형태를 파악한다.

| 산출물 타입 | 예시 | |---|---| | 시각적 | screenshot, HTML, 캔버스 결과, UI 화면 | | 텍스트/구조 | LLM 출력, JSON diff, 로그, API 응답 | | 복합 | screenshot + meta JSON + interview transcript |

기존 프로젝트에 평가 schema, QA checklist, benchmark format이 이미 있으면 그것을 우선한다. 이 SKILL의 컨벤션이 기존 규칙을 대체하지 않는다.

Step 2: 평가 단위 정의

다음 5개를 짧게 확정하고 사용자에게 보여준 뒤 구현으로 진입한다.

평가 대상:     (예: "gemini A/B 생성 결과물 HTML")
비교 단위:     (예: seed × run index, scenario ID, variant ID)
평가 축:       (예: 의도 반영도 / 구체성 / 편집 불필요도, 3~5개)
저장 위치:     (예: runs/<candidateId>/scores.json)
집계 기준:     (예: B - A >= 0.5 across all axes → PASS)

기준이 없으면 제안한다. 사용자가 승인한 뒤 구현한다.

Step 3: 평가 모델

기본 A/B pair evaluation. 대안은 아래 중 선택한다.

| 모드 | 사용 시점 | |---|---| | A/B pair | baseline vs candidate 비교 | | Single-shot | 절대 품질 판단 (pass/fail, severity) | | Ranking | 3개 이상 후보 순위 | | Checklist | 체크리스트 항목 충족 여부 |

축 점수는 1~5 시작. 필요하면 winner selection, binary, severity, confidence로 교체 가능.

Step 4: Score Schema — minimal envelope

저장 포맷은 envelope + project payload 구조를 따른다.

envelope (모든 프로젝트에서 고정):

interface HitlEnvelope {
  schemaVersion: "1";        // schema 버전
  evaluationId: string;      // 평가 단위 식별자 (프로젝트가 정의)
  variantIds: string[];      // 비교 대상 목록 (예: ["tsA", "tsB"])
  status: "draft" | "scored" | "skipped";
  evidenceRefs: string[];    // 근거 파일 경로 목록 (screenshot, html 등)
  reviewerNote?: string;
  savedAt: string;           // ISO8601
  payload: unknown;          // 프로젝트별 점수·코멘트
}

payload (프로젝트가 정의):

// 예시 — A/B axes score payload
interface AbScorePayload {
  axesA: Record<string, number>;
  axesB: Record<string, number>;
  comment: string;
}

규칙:

원본 산출물(screenshot, html, log 등)은 score와 분리해 별도 디렉터리에 보존한다.
같은 평가 단위를 다시 저장하면 idempotent upsert (중복 제거 후 append).
raw score를 잃지 않는다. 나중에 다른 집계 기준을 적용할 수 있어야 한다.

Step 5: 평가 UI 설계

UI는 최대한 단순하게 유지한다.

필수 요소:

현재 pair 정보 (어떤 seed/scenario, 몇 번째 run)
비교 뷰 (좌우 나란히, 또는 탭)
상세 원본 보기 (새 탭으로 열기)
평가 축 입력 (라디오 또는 select)
코멘트 (선택)
Save & Next 버튼
진행도 표시 (n/total)
Prev/Next 이동 (이전 점수 prefill)

기술 선택:

독립 PoC: Node 내장 http 정적 서버 + vanilla HTML/CSS/JS (빌드 없음)
기존 React/Next.js 프로젝트: 기존 라우팅·컴포넌트 관례를 따른다
로컬 전용이므로 인증 불필요

API 최소 설계:

GET  /api/manifests       → 사용 가능한 run set 목록
GET  /api/pair?...        → 현재 pair의 artifact·meta·interview 등
GET  /api/screenshot?...  → 이미지 바이너리
GET  /api/scores?...      → 기존 점수 목록 (prefill용)
POST /api/score           → 점수 저장 (upsert)

Step 6: Aggregate & Report

집계는 재실행 가능한 명령 하나로 닫는다.

# 예시
npm run aggregate
pnpm aggregate
python scripts/aggregate.py

리포트는 Markdown 기본:

메타: 생성 시각, A set/B set ID, 평가 건수, pass 기준
총 평균 테이블: 축 × A/B/Δ
판정: PASS / FAIL
seed 또는 scenario 별 breakdown
코멘트 목록

Step 7: 구현

구현 전 체크리스트:

[ ] Step 2의 5개 항목이 사용자 승인을 받았는가?
[ ] 기존 프로젝트에 평가 schema/포맷이 있는지 확인했는가?
[ ] 기존 package manager와 파일 구조를 확인했는가?

독립 PoC 디렉터리 예시:

eval-dashboard/
├── src/
│   ├── server.ts          # Node http 서버
│   └── aggregate.ts       # 집계 스크립트
├── public/
│   ├── index.html
│   ├── app.js
│   └── style.css
├── runs/
│   └── <candidateId>/
│       ├── scores.json    # envelope array
│       └── <seed>/<variant>/<runIdx>/
│           ├── screenshot.png
│           ├── html.html
│           └── meta.json
└── package.json

구체 구현 패턴은 references/poc-dashboard-pattern.md를 참고한다.

Anti-Patterns

모든 평가를 하나의 strict global schema로 강제하기
screenshot, UX 메모, LLM 출력을 억지로 숫자 점수로만 환원하기
공통 SKILL에 특정 프로젝트의 rubric·파일명·디렉터리 구조를 하드코딩하기
score >= N이면 pass 같은 기계적 threshold만으로 HITL을 닫기
evidence reference 없이 점수만 저장하기 (나중에 왜 그 점수인지 추적 불가)
평가 중간에 원본 산출물을 덮어쓰거나 삭제하기

References

평가 설계 템플릿: references/evaluation-design-template.md
score envelope 예시: references/score-envelope-examples.md
최소 대시보드 패턴: references/poc-dashboard-pattern.md

strongorange/hitl-eval-dashboard

skills/hitl-eval-dashboard/SKILL.md

사람의 주관 판단이 필요한 A/B·before-after·model·prompt·UX·design 품질 비교 평가에서 가벼운 HITL 평가 루프를 설계하고 구현하도록 안내한다. 다음 상황에서 먼저 호출하세요: - "직접 봐야 판단 가능", "A/B 비교", "before/after", "품질 평가", "점수 남기기", "평가 대시보드", "HITL", "리포트 만들기"를 언급 - 자동화 테스트만으로는 품질 판단이 부족하고 사람의 주관이 필요한 경우 - AI 출력, 디자인 생성물, 프롬프트 변경, 파이프라인 변경, UX 후보를 비교할 때 - 평가 결과를 나중에 집계하거나 재개할 수 있어야 하는 경우 구현 스캐폴드가 아니라 가이드형이다. 코드를 바로 만들지 말고 먼저 설계를 확인한다.

data-ai

Updated Apr 27, 2026

$ install --global

skillsauth

npx skillsauth add strongorange/strong-orange-agent-skills hitl-eval-dashboard

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 27, 2026, 6:31 AM64.2s4 files scanned

SKILL.md

name:: hitl-eval-dashboard
description:: |

HITL 평가 대시보드

평가를 구조화해서 "직접 봐야 아는 것"을 재현 가능한 판단 기록으로 남기는 패턴이다.

핵심은 단순함이다: 브라우저에서 빠르게 볼 수 있는 UI, 사람의 점수·코멘트 입력, 로컬 파일 저장, 재실행 가능한 집계 리포트.

Step 1: Discovery — 먼저 읽어라

SKILL을 읽은 직후 바로 코드를 만들지 않는다.

프로젝트의 기존 실행·빌드·스크립트 관례를 읽는다 (package.json, Makefile, pyproject.toml 등).
비교할 산출물 형태를 파악한다.

기존 프로젝트에 평가 schema, QA checklist, benchmark format이 이미 있으면 그것을 우선한다. 이 SKILL의 컨벤션이 기존 규칙을 대체하지 않는다.

Step 2: 평가 단위 정의

다음 5개를 짧게 확정하고 사용자에게 보여준 뒤 구현으로 진입한다.

평가 대상:     (예: "gemini A/B 생성 결과물 HTML")
비교 단위:     (예: seed × run index, scenario ID, variant ID)
평가 축:       (예: 의도 반영도 / 구체성 / 편집 불필요도, 3~5개)
저장 위치:     (예: runs/<candidateId>/scores.json)
집계 기준:     (예: B - A >= 0.5 across all axes → PASS)

기준이 없으면 제안한다. 사용자가 승인한 뒤 구현한다.

Step 3: 평가 모델

기본 A/B pair evaluation. 대안은 아래 중 선택한다.

축 점수는 1~5 시작. 필요하면 winner selection, binary, severity, confidence로 교체 가능.

Step 4: Score Schema — minimal envelope

저장 포맷은 envelope + project payload 구조를 따른다.

envelope (모든 프로젝트에서 고정):

interface HitlEnvelope {
  schemaVersion: "1";        // schema 버전
  evaluationId: string;      // 평가 단위 식별자 (프로젝트가 정의)
  variantIds: string[];      // 비교 대상 목록 (예: ["tsA", "tsB"])
  status: "draft" | "scored" | "skipped";
  evidenceRefs: string[];    // 근거 파일 경로 목록 (screenshot, html 등)
  reviewerNote?: string;
  savedAt: string;           // ISO8601
  payload: unknown;          // 프로젝트별 점수·코멘트
}

payload (프로젝트가 정의):

// 예시 — A/B axes score payload
interface AbScorePayload {
  axesA: Record<string, number>;
  axesB: Record<string, number>;
  comment: string;
}

규칙:

원본 산출물(screenshot, html, log 등)은 score와 분리해 별도 디렉터리에 보존한다.
같은 평가 단위를 다시 저장하면 idempotent upsert (중복 제거 후 append).
raw score를 잃지 않는다. 나중에 다른 집계 기준을 적용할 수 있어야 한다.

Step 5: 평가 UI 설계

UI는 최대한 단순하게 유지한다.

필수 요소:

현재 pair 정보 (어떤 seed/scenario, 몇 번째 run)
비교 뷰 (좌우 나란히, 또는 탭)
상세 원본 보기 (새 탭으로 열기)
평가 축 입력 (라디오 또는 select)
코멘트 (선택)
Save & Next 버튼
진행도 표시 (n/total)
Prev/Next 이동 (이전 점수 prefill)

기술 선택:

독립 PoC: Node 내장 http 정적 서버 + vanilla HTML/CSS/JS (빌드 없음)
기존 React/Next.js 프로젝트: 기존 라우팅·컴포넌트 관례를 따른다
로컬 전용이므로 인증 불필요

API 최소 설계:

GET  /api/manifests       → 사용 가능한 run set 목록
GET  /api/pair?...        → 현재 pair의 artifact·meta·interview 등
GET  /api/screenshot?...  → 이미지 바이너리
GET  /api/scores?...      → 기존 점수 목록 (prefill용)
POST /api/score           → 점수 저장 (upsert)

Step 6: Aggregate & Report

집계는 재실행 가능한 명령 하나로 닫는다.

# 예시
npm run aggregate
pnpm aggregate
python scripts/aggregate.py

리포트는 Markdown 기본:

메타: 생성 시각, A set/B set ID, 평가 건수, pass 기준
총 평균 테이블: 축 × A/B/Δ
판정: PASS / FAIL
seed 또는 scenario 별 breakdown
코멘트 목록

Step 7: 구현

구현 전 체크리스트:

[ ] Step 2의 5개 항목이 사용자 승인을 받았는가?
[ ] 기존 프로젝트에 평가 schema/포맷이 있는지 확인했는가?
[ ] 기존 package manager와 파일 구조를 확인했는가?

독립 PoC 디렉터리 예시:

eval-dashboard/
├── src/
│   ├── server.ts          # Node http 서버
│   └── aggregate.ts       # 집계 스크립트
├── public/
│   ├── index.html
│   ├── app.js
│   └── style.css
├── runs/
│   └── <candidateId>/
│       ├── scores.json    # envelope array
│       └── <seed>/<variant>/<runIdx>/
│           ├── screenshot.png
│           ├── html.html
│           └── meta.json
└── package.json

구체 구현 패턴은 references/poc-dashboard-pattern.md를 참고한다.

Anti-Patterns

모든 평가를 하나의 strict global schema로 강제하기
screenshot, UX 메모, LLM 출력을 억지로 숫자 점수로만 환원하기
공통 SKILL에 특정 프로젝트의 rubric·파일명·디렉터리 구조를 하드코딩하기
score >= N이면 pass 같은 기계적 threshold만으로 HITL을 닫기
evidence reference 없이 점수만 저장하기 (나중에 왜 그 점수인지 추적 불가)
평가 중간에 원본 산출물을 덮어쓰거나 삭제하기

References

평가 설계 템플릿: references/evaluation-design-template.md
score envelope 예시: references/score-envelope-examples.md
최소 대시보드 패턴: references/poc-dashboard-pattern.md

Related Skills

strongorange/resume-driven-development-coach

development

VerifiedTrustedCommunity

Use when extracting honest resume bullets, 1-minute interview stories, or likely interviewer questions from a defined work period (a project, week, sprint, or custom date range). Collects evidence from git, GitHub, and an Obsidian vault, filters strictly to verified self-attribution, refuses to invent impact metrics, and marks AI-assist where material. Triggers on "이력서 bullet", "주간 회고 이력서", "프로젝트 X 회고 이력서 톤", "포트폴리오 자료 뽑아줘".

SKILL.mdUpdated May 13, 2026

strongorange/resume-driven-development-coach

strongorange/ts-web-tdd-orchestrator

development

VerifiedTrustedCommunity

Plan-first staged TDD orchestration for TypeScript web repositories. Use when a TypeScript, React, Next.js, or Node web task benefits from a short spec, explicit red/green/refactor/verify stages, deterministic validation, and keeping code changes behind explicit user approval. Best suited for small feature additions, policy changes, and bug fixes where tests should act as the primary oracle.

SKILL.mdUpdated Apr 15, 2026

strongorange/ts-web-tdd-orchestrator

strongorange/toolditor-readable-commits

tools

VerifiedTrustedCommunity

Split large git changes into readable, review-friendly, revert-safe commit stacks in Toolditor. Use when preparing a PR, reorganizing staged/unstaged changes, enforcing Korean `[type] message` commit convention, or producing commit history that is easy to review and selectively revert.

SKILL.mdUpdated Apr 15, 2026

strongorange/toolditor-readable-commits

strongorange/strategy-template-governor

development

VerifiedTrustedCommunity

Decide between Strategy, Template Method, or Hybrid for architecture and refactor tasks across frontend/backend systems. Use when evaluating branching logic, adding new variants/providers/modes, introducing shared execution pipelines, or debating runtime algorithm switching versus fixed skeleton workflows. Trigger especially for AI feature flows and payment/provider integrations where pattern choice affects extensibility, testability, and change risk.

SKILL.mdUpdated Apr 15, 2026

strongorange/strategy-template-governor

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/strongorange/strong-orange-agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r strong-orange-agent-skills/skills/hitl-eval-dashboard ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

strongorange/strong-orange-agent-skills

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT