Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

mediar-ai/packages/skills/skills/advanced-evaluation

Name: packages/skills/skills/advanced-evaluation
Author: mediar-ai

packages/skills/skills/advanced-evaluation/SKILL.md

npx skillsauth add mediar-ai/skillhubz packages/skills/skills/advanced-evaluation

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Advanced Evaluation

Production-grade techniques for evaluating LLM outputs using LLM-as-judge approaches with bias mitigation.

Prerequisites

Understanding of evaluation metrics
Access to LLM APIs for judge models

Instructions

Core Approaches

Direct Scoring: Single LLM rates one response on a defined scale.

Best for: Objective criteria (factual accuracy, instruction following)
Requires: Clear criteria, calibrated scale, chain-of-thought justification

Pairwise Comparison: LLM compares two responses and selects the better one.

Best for: Subjective preferences (tone, style, persuasiveness)
Requires: Position bias mitigation (swap positions and check consistency)

Bias Mitigation

| Bias | Mitigation | |------|------------| | Position Bias | Evaluate twice with swapped positions | | Length Bias | Explicit prompting to ignore length | | Self-Enhancement | Use different models for generation and evaluation | | Verbosity Bias | Criteria-specific rubrics |

Pairwise Comparison Protocol

First pass: Response A first, Response B second
Second pass: Response B first, Response A second
Consistency check: If passes disagree, return TIE
Final verdict: Consistent winner with averaged confidence

Rubric Components

Level descriptions: Clear boundaries for each score
Characteristics: Observable features per level
Examples: Representative text (optional but valuable)
Edge cases: Guidance for ambiguous situations

Decision Framework

Is there objective ground truth?
├── Yes → Direct Scoring (factual accuracy, format compliance)
└── No → Is it preference/quality judgment?
    ├── Yes → Pairwise Comparison (tone, creativity)
    └── No → Reference-based evaluation

Guidelines

Always require justification before scores (15-25% reliability improvement)
Always swap positions in pairwise comparison
Match scale granularity to rubric specificity
Separate objective and subjective criteria
Include confidence scores calibrated to evidence strength

Notes

Chain-of-thought prompting improves evaluation reliability
Single-pass pairwise comparison is corrupted by position bias
Validate automated evaluation against human judgments

Source: muratcankoylan/Agent-Skills-for-Context-Engineering

mediar-ai/packages/skills/skills/advanced-evaluation

packages/skills/skills/advanced-evaluation/SKILL.md

# Advanced Evaluation Production-grade techniques for evaluating LLM outputs using LLM-as-judge approaches with bias mitigation. ## Prerequisites - Understanding of evaluation metrics - Access to LLM APIs for judge models ## Instructions ### Core Approaches **Direct Scoring**: Single LLM rates one response on a defined scale. - Best for: Objective criteria (factual accuracy, instruction following) - Requires: Clear criteria, calibrated scale, chain-of-thought justification **Pairwise Comp

4 stars

development

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add mediar-ai/skillhubz packages/skills/skills/advanced-evaluation

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 9:02 PM1.8s1 file scanned

SKILL.md

Advanced Evaluation

Production-grade techniques for evaluating LLM outputs using LLM-as-judge approaches with bias mitigation.

Prerequisites

Understanding of evaluation metrics
Access to LLM APIs for judge models

Instructions

Core Approaches

Direct Scoring: Single LLM rates one response on a defined scale.

Best for: Objective criteria (factual accuracy, instruction following)
Requires: Clear criteria, calibrated scale, chain-of-thought justification

Pairwise Comparison: LLM compares two responses and selects the better one.

Best for: Subjective preferences (tone, style, persuasiveness)
Requires: Position bias mitigation (swap positions and check consistency)

Bias Mitigation

Pairwise Comparison Protocol

First pass: Response A first, Response B second
Second pass: Response B first, Response A second
Consistency check: If passes disagree, return TIE
Final verdict: Consistent winner with averaged confidence

Rubric Components

Level descriptions: Clear boundaries for each score
Characteristics: Observable features per level
Examples: Representative text (optional but valuable)
Edge cases: Guidance for ambiguous situations

Decision Framework

Is there objective ground truth?
├── Yes → Direct Scoring (factual accuracy, format compliance)
└── No → Is it preference/quality judgment?
    ├── Yes → Pairwise Comparison (tone, creativity)
    └── No → Reference-based evaluation

Guidelines

Always require justification before scores (15-25% reliability improvement)
Always swap positions in pairwise comparison
Match scale granularity to rubric specificity
Separate objective and subjective criteria
Include confidence scores calibrated to evidence strength

Notes

Chain-of-thought prompting improves evaluation reliability
Single-pass pairwise comparison is corrupted by position bias
Validate automated evaluation against human judgments

Source: muratcankoylan/Agent-Skills-for-Context-Engineering

Related Skills

mediar-ai/tui-ui

tools

VerifiedTrustedCommunity

Design web-like user interfaces in the terminal and inside tmux with a cell-grid Canvas, CSS-like box model, flexbox/grid layout, and 15 reusable widgets such as Panel, Table, Card, ProgressBar, Meter, Tabs, Tree, Badge, Banner, and a braille line chart. Use when an agent needs a dashboard, panel, table, status page, TUI layout, tmux dashboard, screenshot-driven CLI/TUI replica, ANSI frame, truecolor render, pyte PNG screenshot smoke test, wide-character alignment, or a new terminal widget.

7SKILL.mdUpdated Jul 11, 2026

mediar-ai/drive-tui

tools

VerifiedTrustedCommunity

Drive interactive terminal (TUI) programs — CLIs, REPLs, installers, menu apps, agent CLIs, and editors like vim — through a PTY, reading semantic screen snapshots. A pattern library classifies a screen (REPL, menu, pager, fzf search, confirm dialog, form, spinner, wizard) and drives it with a ready recipe. Use when a program expects a live terminal (arrow-key menus, prompts, spinners, password fields, curses UIs), or when a piped command hangs or prints nothing.

7SKILL.mdUpdated Jul 11, 2026

mediar-ai/cmd-art

tools

VerifiedTrustedCommunity

Design and render terminal/CMD visual effects and ASCII art from a one-line request via the pluggable `fx` engine (18 hot-swappable, themeable effects plus scripted shows). Effects include donut, matrix rain, plasma, fire, a spinning 3D ball, Game of Life, wireframe cube, 3D text banners, rainbow/lolcat gradient text, starfield, tunnel, fireworks, image-to-ASCII, and more. Use when the request is for a terminal animation, ANSI/CLI art, or a new console effect. Pure Python stdlib; truecolor.

7SKILL.mdUpdated Jul 11, 2026

mediar-ai/packages/skills/skills/x-twitter-scraper

tools

VerifiedTrustedCommunity

# X Twitter Scraper Use Xquik for X/Twitter tweet search, user lookup, profile tweets, follower export, media download, monitors, webhooks, posting workflows, and MCP-backed API exploration. ## Prerequisites - A Xquik API key in `XQUIK_API_KEY`. - Internet access to `https://xquik.com/api/v1`, `https://xquik.com/mcp`, and `https://docs.xquik.com`. - A clear user request that identifies the target tweets, users, accounts, keywords, media, monitor, webhook, or write action. ## Source Truth -

6SKILL.mdUpdated May 31, 2026

mediar-ai/packages/skills/skills/x-twitter-scraper

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/mediar-ai/skillhubz.git

# Copy into Claude Code skills folder (global)
cp -r skillhubz/packages/skills/skills/advanced-evaluation ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

mediar-ai/skillhubz

4 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT