packages/skills/skills/advanced-evaluation/SKILL.md
# Advanced Evaluation Production-grade techniques for evaluating LLM outputs using LLM-as-judge approaches with bias mitigation. ## Prerequisites - Understanding of evaluation metrics - Access to LLM APIs for judge models ## Instructions ### Core Approaches **Direct Scoring**: Single LLM rates one response on a defined scale. - Best for: Objective criteria (factual accuracy, instruction following) - Requires: Clear criteria, calibrated scale, chain-of-thought justification **Pairwise Comp
npx skillsauth add mediar-ai/skillhubz packages/skills/skills/advanced-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Production-grade techniques for evaluating LLM outputs using LLM-as-judge approaches with bias mitigation.
Direct Scoring: Single LLM rates one response on a defined scale.
Pairwise Comparison: LLM compares two responses and selects the better one.
| Bias | Mitigation | |------|------------| | Position Bias | Evaluate twice with swapped positions | | Length Bias | Explicit prompting to ignore length | | Self-Enhancement | Use different models for generation and evaluation | | Verbosity Bias | Criteria-specific rubrics |
Is there objective ground truth?
├── Yes → Direct Scoring (factual accuracy, format compliance)
└── No → Is it preference/quality judgment?
├── Yes → Pairwise Comparison (tone, creativity)
└── No → Reference-based evaluation
Source: muratcankoylan/Agent-Skills-for-Context-Engineering
tools
# X Twitter Scraper Use Xquik for X/Twitter tweet search, user lookup, profile tweets, follower export, media download, monitors, webhooks, posting workflows, and MCP-backed API exploration. ## Prerequisites - A Xquik API key in `XQUIK_API_KEY`. - Internet access to `https://xquik.com/api/v1`, `https://xquik.com/mcp`, and `https://docs.xquik.com`. - A clear user request that identifies the target tweets, users, accounts, keywords, media, monitor, webhook, or write action. ## Source Truth -
tools
Use when the user says "mk0r", "appmaker CLI", "open a VM", "run something in the sandbox", "talk to the VM agent", "spin up an E2B sandbox", or "chat with appmaker from CLI." Wraps the `mk0r` CLI to list projects, exec commands inside their E2B sandboxes, stream chat with the VM agent (same `/api/chat` the web UI uses), toggle SOAX residential IP, manage schedules, and copy files. Supports a sticky default project via `mk0r projects use`.
testing
Use when the user mentions "influencer candidates", "social media operator", "check proposals on Upwork/Fiverr", "review influencer applications", "qualify candidates", or "reach out to operators". Manages the IG/TikTok account operator hiring pipeline — review applicants, check replies, qualify, and do proactive outreach.
tools
End-to-end newsletter pipeline: investigate recent features, draft, send via API endpoint, and track delivery/open/click metrics.