skills/apex-agents/SKILL.md
Design and execute long-horizon, cross-application agent workflows for professional knowledge work (finance, consulting, legal). Applies the APEX-Agents benchmark methodology to structure multi-step tasks that span files, spreadsheets, documents, email, calendars, and code execution within realistic work environments. Trigger phrases: - "Build an agent workflow for this banking/consulting/legal task" - "Create a cross-application task pipeline" - "Design a multi-step professional workflow with rubric evaluation" - "Set up an Archipelago-style sandboxed agent environment" - "Evaluate agent performance on a long-horizon task" - "Break this professional task into rubric-graded criteria"
npx skillsauth add ndpvt-web/arxiv-claude-skills apex-agentsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design, decompose, and execute complex professional knowledge-work tasks that span multiple applications and file types — the kind of work investment banking analysts, management consultants, and corporate lawyers do daily. Drawing from the APEX-Agents benchmark (Vidgen et al., 2026), it teaches Claude to structure tasks as multi-criteria workflows operating across spreadsheets, documents, PDFs, email, calendars, code execution, and file systems, then evaluate outputs against binary rubric criteria. The core insight: professional agent tasks are not single-tool problems — they require navigating a "world" of ~166 interrelated files using 8-10 different application types, with success measured by whether specific factual criteria are met.
Cross-application task decomposition with rubric-based evaluation. APEX-Agents demonstrates that professional knowledge work is fundamentally multi-application: a single task might require reading a PDF contract, cross-referencing data in a spreadsheet, drafting an email summary, and updating a calendar — all within a shared "world" of ~166 files. The benchmark's 480 tasks average 1.82 estimated human-hours each and require satisfying an average of 4.06 binary criteria per task. The critical design choice is that each criterion is independently gradable as Met/Not Met, making evaluation unambiguous. Tasks span three professional domains: investment banking (deal analysis, financial modeling), management consulting (market research, slide decks), and corporate law (contract review, due diligence).
The "world" abstraction. Rather than giving agents isolated inputs, APEX-Agents places them in a realistic work environment — a directory tree of 150-180 files including spreadsheets, documents, PDFs, presentations, emails, calendar entries, and chat logs. The agent must discover which files are relevant, extract the right information, and produce outputs using the correct application. This mirrors real professional work where knowing where to find information is half the challenge. The benchmark includes 33 such worlds (10 banking, 11 consulting, 12 law), each shared across multiple tasks.
Archipelago execution model. The open-source Archipelago infrastructure provides sandboxed environments where agents can execute code, read/write files, send simulated emails, and interact with tool APIs — all without side effects. Web search is intentionally disabled to ensure reproducibility. This sandboxed approach is directly applicable when building agent evaluation pipelines: isolate the agent, give it a file system and tool access, and grade its outputs against predetermined criteria.
Define the world: Inventory all files, data sources, and applications the agent will need. Organize them into a directory structure mirroring a realistic work environment (documents/, spreadsheets/, emails/, presentations/, etc.). Aim for completeness — include both relevant and irrelevant files so the agent must exercise judgment about what matters.
Write the task prompt as a single-turn instruction: Phrase the task as a professional would delegate it: "Review the Q3 earnings data in the financials folder, cross-reference against the analyst projections in the strategy deck, and draft a client email summarizing the three largest variances." The prompt should be self-contained but require multi-file navigation.
Decompose into binary rubric criteria: Break the expected output into 2-10 independently verifiable criteria. Each criterion must be answerable as Met or Not Met with no ambiguity. Examples: "Email correctly states the largest variance is in COGS at $2.3M" or "Output spreadsheet contains a pivot table grouped by region."
Create gold reference outputs: Produce the correct output(s) manually. These serve as the ground truth for evaluation. Include the exact format expected (e.g., .xlsx with specific sheet names, .docx with required sections, plain text email body).
Configure the execution sandbox: Set up an isolated environment with file system access, spreadsheet read/write capability, document creation tools, code execution (Python/JS), and simulated email/calendar APIs. Disable external network access for reproducibility.
Execute the agent against the task: Provide the agent with the task prompt and access to the world's file system. Let it navigate, read files, execute code, and produce outputs without human intervention. Log all tool calls and file accesses for debugging.
Grade each rubric criterion independently: For each criterion, compare the agent's output against the gold reference. Use a judge (human or LLM) that receives the prompt, the agent's output, and relevant artifacts. Score each criterion as 1 (Met) or 0 (Not Met).
Compute Pass@1: A task passes only if ALL rubric criteria are met on a single attempt. This strict metric reflects professional standards — a deal memo with the wrong numbers is a failure regardless of how well-formatted it is.
Analyze failure modes: Categorize failures by type: wrong file accessed, correct file but wrong data extracted, correct data but wrong computation, correct computation but wrong output format. This diagnosis drives targeted improvements.
Iterate on agent capabilities: Based on failure analysis, improve the agent's file navigation heuristics, tool selection logic, or output formatting. Re-run against the same tasks to measure improvement.
Example 1: Investment Banking — Variance Analysis Email
User: "I have a folder with Q3 financial reports, analyst projections, and client
correspondence. I need to identify the top 3 revenue variances between actuals and
projections, then draft a client-ready email summarizing the findings."
Approach:
1. Scan the world directory to find relevant files:
- financials/q3_actuals.xlsx (actual revenue by segment)
- projections/analyst_forecast_q3.xlsx (projected revenue)
- correspondence/client_templates/ (email format reference)
2. Load both spreadsheets, align on segment names, compute variance = actual - projected
for each segment.
3. Rank by absolute variance, select top 3.
4. Draft email using the client template format, citing specific figures.
Rubric Criteria:
- [Met/Not Met] Email identifies correct top 3 segments by variance magnitude
- [Met/Not Met] Each variance figure matches computed difference within rounding
- [Met/Not Met] Email follows the client template format from correspondence/
- [Met/Not Met] Email includes both actual and projected figures for context
Output:
Subject: Q3 Revenue Variance Summary — Top 3 Segments
Dear [Client],
Following our review of Q3 actuals against projections:
1. Enterprise Cloud: $4.2M actual vs. $3.1M projected (+$1.1M / +35.5%)
2. Consumer Hardware: $2.8M actual vs. $3.5M projected (-$0.7M / -20.0%)
3. Professional Services: $1.9M actual vs. $1.5M projected (+$0.4M / +26.7%)
[Remainder of email following template structure...]
Example 2: Management Consulting — Market Sizing Deck
User: "Using the industry reports and internal data in the project folder, build a
market sizing slide for the European EV charging market. Include TAM, SAM, SOM with
supporting calculations."
Approach:
1. Locate relevant files in the world:
- data/eurostat_ev_registrations.xlsx
- reports/mckinsey_ev_charging_2025.pdf
- internal/client_geographic_footprint.xlsx
- templates/slide_template.pptx
2. Extract European EV registration data and growth rates from the spreadsheet.
3. Pull charging infrastructure market size from the McKinsey PDF.
4. Cross-reference client's geographic presence to compute SAM (countries where
client operates) and SOM (realistic capture rate based on internal data).
5. Build calculations in a supporting spreadsheet, create the summary slide.
Rubric Criteria:
- [Met/Not Met] TAM figure sourced from the McKinsey report with correct citation
- [Met/Not Met] SAM correctly filtered to client's 6 operating countries
- [Met/Not Met] SOM applies the 3-5% capture rate from internal strategy doc
- [Met/Not Met] Supporting calculations provided in a separate .xlsx file
- [Met/Not Met] Slide uses the provided template format
Output:
- market_sizing_slide.pptx (single slide with TAM/SAM/SOM funnel)
- market_sizing_calcs.xlsx (detailed build-up with sourced assumptions)
Example 3: Corporate Law — Contract Red Flag Review
User: "Review the draft SPA in the deals folder against our standard terms checklist.
Flag any clauses that deviate from our templates and summarize the risk exposure."
Approach:
1. Locate files:
- deals/draft_spa_acme_v3.docx (the contract to review)
- templates/standard_spa_terms.docx (firm's standard terms)
- checklists/spa_review_checklist.xlsx (checklist of 42 standard clauses)
2. Parse both documents, align clause-by-clause against the checklist.
3. For each deviation, classify severity (High/Medium/Low) based on
the checklist's risk weighting column.
4. Produce a summary memo listing deviations with page references.
Rubric Criteria:
- [Met/Not Met] All High-severity deviations identified (indemnity cap, governing law, IP assignment)
- [Met/Not Met] Each flagged deviation references the correct clause number in the draft
- [Met/Not Met] Risk exposure summary includes estimated financial impact where applicable
- [Met/Not Met] Output formatted as a memo following firm template in templates/memo_format.docx
Do:
Avoid:
| Failure Mode | Detection | Resolution | |---|---|---| | Agent opens wrong file | File access log shows irrelevant files read | Improve file naming conventions or add a manifest/index file to the world | | Correct file, wrong data extracted | Output contains stale or adjacent-cell data | Ensure spreadsheets have clear headers; add data validation criteria to rubric | | Computation error | Rubric criterion fails on specific number | Include the expected computation method in the task prompt or provide a formula reference | | Output format mismatch | Agent produces .txt instead of .xlsx | Specify exact output format in the task prompt; add format-check criterion to rubric | | Agent halts mid-task | Execution log shows tool error or timeout | Sandbox must handle tool failures gracefully; set per-task time limits with clear timeout behavior | | Rubric criterion ambiguous | Two human graders disagree on Met/Not Met | Rewrite criterion to reference a specific, verifiable fact or artifact property |
Paper: Vidgen, B., Mann, A., Fennelly, A., Stanly, J.W., & Rothman, L. (2026). APEX-Agents: AI Productivity Index for Agents. arXiv:2601.14242v2. https://arxiv.org/abs/2601.14242v2
Key takeaway: Look at Section 3 for the task creation methodology (how professionals designed 480 tasks across 33 worlds), Section 4 for the Archipelago execution infrastructure, and the results tables for failure mode analysis showing that cross-application file navigation — not computation — is the primary bottleneck for current agents.
Dataset: mercor/apex-agents on HuggingFace — 480 tasks with prompts, rubrics, gold outputs, and metadata.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".