skills/entworld-holistic-environment-benchmark/SKILL.md
Build verifiable enterprise GUI agent benchmarks using schema-grounded task generation and SQL-based deterministic verification. Use when: 'generate enterprise test tasks from a database schema', 'build SQL verification for GUI agent tasks', 'create benchmark for CRM/ERP/ITIL agents', 'reverse-engineer business logic from DB schema', 'validate agent actions with database state checks', 'set up dockerized enterprise benchmark environments'.
npx skillsauth add ndpvt-web/arxiv-claude-skills entworld-holistic-environment-benchmarkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to build enterprise-grade GUI agent benchmarks and verification systems using the EntWorld methodology. The core technique reverse-engineers business logic directly from database schemas to synthesize realistic multi-step tasks, then validates agent completion through SQL-based state-transition checks rather than brittle visual matching. Apply this when building evaluation harnesses for enterprise automation agents, generating test scenarios from existing databases, or implementing deterministic verification of GUI-driven workflows across CRM, ERP, ITIL, asset management, and project management domains.
Schema-Grounded Task Generation replaces manual task authoring with a four-stage pipeline that starts from the database itself. First, Schema Discovery queries the database catalog, filters non-empty tables, and uses an LLM to produce a "SchemaJSON" that infers each table's business purpose from column names, types, and sample rows. Second, Relationship Inference reconstructs entity-relationship graphs by parsing explicit foreign keys and using LLM-based inference for implicit links (validated by probe SQL queries that confirm join results are non-empty). Third, Task Template Synthesis generates parameterized task templates containing natural-language prompts plus SQL logic with placeholders, constrained by the verified schema graph. Fourth, Data Instantiation executes cross-table queries to populate templates with real database records, producing concrete tasks with ground-truth answers.
SQL-Based Deterministic Verification replaces visual matching and LLM-as-a-judge approaches with direct database interrogation. After an agent completes a GUI task, the system executes a pre-defined SQL query against the application's database to check whether the expected state transition occurred. For read operations (SELECT), it compares the agent's extracted answer against the query result. For create/update/delete operations, it validates affected rows within a transaction block, then issues a rollback to preserve the environment for subsequent tasks. This makes verification binary and reproducible: either the database reflects the correct state or it does not.
Difficulty Quantification scores each task as D_task = sum(w_i * d_i) across five dimensions: table complexity (number of tables involved), relational complexity (join depth), operation complexity (action count), result complexity (output cardinality), and SQL complexity (subqueries, aggregates, conditions). This enables stratified evaluation so you can identify where agents fail (e.g., long-horizon tasks >15 steps where GPT-4.1 drops to 2.7% success).
Inventory the database schema. Connect to the target application's database, enumerate all non-empty tables, and extract column names, types, foreign keys, and row counts. Store this as a structured catalog (JSON or YAML).
Generate SchemaJSON for each table. For every table, prompt an LLM with the table name, column definitions, and 3-5 sample rows. Ask it to infer the business purpose, identify which columns are user-facing fields vs. internal IDs, and classify the table's domain role (e.g., "contact record", "service ticket", "asset entry").
Reconstruct entity-relationship graphs. Parse explicit foreign key constraints directly. For tables lacking declared FKs, use LLM inference on column name/type similarity (e.g., customer_id in an orders table likely references id in a customers table). Validate every inferred relationship by running a probe query: SELECT 1 FROM A JOIN B ON A.col = B.col LIMIT 1. Discard relationships that return zero rows.
Synthesize task templates. Prompt an LLM with the verified schema graph and ask it to generate task templates. Each template must include: (a) a natural-language instruction describing the GUI workflow, (b) the SQL query that encodes the expected outcome, and (c) placeholder tokens for entity-specific values. Constrain templates to paths that exist in the schema graph.
Instantiate tasks with real data. Execute cross-table SELECT queries to fill placeholders with authentic database records. For state-change tasks (create/update/delete), generate realistic data variants using an LLM (e.g., plausible contact names, ticket descriptions). Validate each instantiated SQL query by running it in a transaction block and confirming non-empty results before rolling back.
Build verification queries. For each task, finalize the SQL verification query: SELECT-based for information retrieval tasks (compare agent output to query result), or row-count/field-match checks for CUD operations (run pre-task snapshot, execute agent, run post-task query, diff). Wrap CUD verifications in transactions with rollback.
Assign difficulty scores. Compute D_task for each task by measuring: number of tables referenced, join depth, total GUI actions required, result set size, and SQL complexity (presence of subqueries, aggregates, GROUP BY, HAVING). Weight each dimension and produce a single numeric score for stratified analysis.
Package the environment in Docker. Create a self-contained Docker image per enterprise application that bundles the web app, database (pre-populated with benchmark data), and all dependencies. Expose the web UI on a mapped port. Include a require_reset flag on CUD tasks so the environment can be restored between runs.
Run agent evaluation. Deploy the Docker environment, present each task's natural-language instruction to the agent (along with a screenshot + accessibility tree of the current page), let the agent interact with the GUI, then execute the verification SQL against the database to produce a binary pass/fail result.
Analyze results by difficulty stratum. Group tasks by difficulty score, domain, and action count. Identify failure patterns: short tasks (<5 steps) vs. long tasks (>15 steps), single-table lookups vs. multi-join queries, and domain-specific challenges (e.g., ITIL systems with dense configuration management interfaces).
Example 1: Generating a CRM benchmark task from schema
User: "I have an EspoCRM database. Generate a benchmark task that tests whether an agent can find a specific contact's related opportunities."
Approach:
contact and opportunity tables and their join table contact_opportunity.contact holds person records, opportunity holds sales deals, linked via contact_opportunity.contact_id and contact_opportunity.opportunity_id.SELECT name FROM contact WHERE deleted=0 LIMIT 1 to get a real contact name, e.g., "Maria Chen".SELECT SUM(o.amount) AS total_revenue
FROM opportunity o
JOIN contact_opportunity co ON o.id = co.opportunity_id
JOIN contact c ON c.id = co.contact_id
WHERE c.name = 'Maria Chen'
AND o.stage NOT IN ('Closed Lost', 'Closed Won')
AND o.deleted = 0;
Output: Task with instruction "Find all open opportunities for Maria Chen and report the total expected revenue", verified by comparing the agent's answer against the SQL result (e.g., $142,500).
Example 2: SQL-based verification for a state-change task
User: "Create a verification check for a task where the agent must update a service ticket's priority in iTOP."
Approach:
-- Pre-task snapshot (run before agent acts)
SELECT id, priority FROM ticket WHERE ref = 'TKT-4821';
-- Expected: priority = 'Medium' (encoded as 2)
-- Post-task verification
SELECT CASE
WHEN priority = 4 THEN 'PASS'
ELSE 'FAIL'
END AS result
FROM ticket
WHERE ref = 'TKT-4821';
BEGIN;
UPDATE ticket SET priority = 4 WHERE ref = 'TKT-4821';
-- Verify: SELECT priority FROM ticket WHERE ref = 'TKT-4821'; => 4
ROLLBACK;
Output: Binary PASS/FAIL based on database state, no screenshot comparison needed.
Example 3: Dockerized benchmark environment setup
User: "Set up a reproducible evaluation environment for testing GUI agents on OpenProject."
Approach:
FROM openproject/openproject:14
COPY benchmark_seed.sql /docker-entrypoint-initdb.d/
COPY tasks.json /benchmark/tasks.json
EXPOSE 8080 5432
ENV OPENPROJECT_SECRET_KEY_BASE=benchmark_secret
ENV DATABASE_URL=postgres://openproject:openproject@localhost/openproject
docker build -t entworld-openproject . && docker run -p 8080:8080 -p 5432:5432 entworld-openprojecthttp://localhost:8080, receives task instructions from tasks.json, and verification queries run against localhost:5432.Output: Self-contained Docker image that any team can pull and run to reproduce identical evaluation conditions.
require_reset: true in your task manifest.WHERE deleted = 0 (or equivalent) in both task instantiation and verification queries.LIMIT 1 to all relationship validation queries. You only need to confirm the join path exists, not enumerate results.tbl_x1, col_a) will produce low-quality tasks.EntWorld: A Holistic Environment and Benchmark for Verifiable Enterprise GUI Agents — Mo et al., 2026. Focus on Section 3 (schema-grounded task generation pipeline), Section 4 (SQL-based deterministic verification), and Section 5.3 (difficulty quantification formula) for implementation details.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".