skills/assessing-business-process-modeling/SKILL.md
Evaluate and generate BPMN process models from natural language using the BEF4LLM framework. Assess BPMN XML quality across syntactic, pragmatic, semantic, and validity dimensions. Triggers: 'generate BPMN from description', 'evaluate BPMN model quality', 'convert process description to BPMN XML', 'assess business process model', 'validate BPMN diagram', 'text to BPMN'.
npx skillsauth add ndpvt-web/arxiv-claude-skills assessing-business-process-modelingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to convert natural-language process descriptions into well-formed BPMN 2.0 XML and to evaluate generated BPMN models using the BEF4LLM framework (Lauer et al., 2026). BEF4LLM defines four orthogonal quality perspectives -- syntactic, pragmatic, semantic, and validity -- each with concrete, scoreable metrics. By internalizing these perspectives, Claude can both produce higher-quality BPMN output and provide structured quality assessments of existing models.
BEF4LLM (Business Process Evaluation Framework for LLMs) systematically scores BPMN models across four dimensions rather than relying on ad-hoc or LLM-as-a-judge evaluation. Each dimension captures a different facet of model quality:
Validity (1 metric): Binary XSD schema validation of the BPMN XML. This gate determines whether further evaluation is meaningful -- an invalid XML cannot be assessed for syntactic or semantic quality. The paper shows that producing valid BPMN XML is the single hardest challenge for LLMs; many models fail this gate entirely.
Syntactic Quality (16 metrics): Checks adherence to BPMN specification rules -- start/end event existence, correct sequence-flow connectivity, gateway split/join matching, proper pool-lane nesting, message-flow rules, and element labeling. Each rule scores 0 or 1; the dimension score is the average across all 16 checks.
Pragmatic Quality (15 metrics): Measures human understandability via structural complexity indicators grouped into size (node count, diameter), density (connection ratios), connector interplay (gateway heterogeneity, control-flow complexity), partitionability (sequentiality, separability, depth), and concurrency (token splits). Raw values are normalized to a five-point scale (0.0, 0.25, 0.5, 0.75, 1.0) using empirically validated thresholds.
Semantic Quality (7 metrics): Compares generated models against a ground-truth reference using three families -- natural-language similarity (syntactic label matching, semantic embedding similarity, context word overlap), graph-structure similarity (graph-edit distance, common nodes/edges), and behavioral similarity (causal-footprint overlap, dependency-graph overlap). Node alignment uses optimal bipartite matching before scoring.
The total quality score aggregates as: Q_total = (Q_syn + Q_prag + Q_sem + Q_val) / 4. The generation pipeline uses a system prompt establishing an expert BPMN modeler role, a one-shot example, and a refinement loop: if the initial XML fails validation, a correction prompt listing common mistakes and actual validation errors is sent for one retry.
Parse the process description to identify actors (pools/lanes), activities (tasks), decision points (gateways), events (start, end, intermediate), and message exchanges between participants.
Map actors to BPMN swimlanes: Create a <bpmn:collaboration> with one <bpmn:participant> per actor. Each participant references a <bpmn:process>. Add <bpmn:lane> elements within each process if sub-roles exist.
Sequence activities into flow objects: For each identified task, create a <bpmn:task> with a descriptive label. Order them according to the described process flow using <bpmn:sequenceFlow> elements.
Insert gateways for branching logic: Use <bpmn:exclusiveGateway> for XOR decisions ("if/else"), <bpmn:parallelGateway> for AND forks/joins ("simultaneously"), and <bpmn:inclusiveGateway> for OR logic. Ensure every split gateway has a matching join gateway of the same type.
Add start and end events: Every process must have exactly one <bpmn:startEvent> and at least one <bpmn:endEvent>. Add <bpmn:intermediateThrowEvent> or <bpmn:intermediateCatchEvent> for message sends/receives between pools.
Wire message flows: For inter-participant communication, add <bpmn:messageFlow> elements at the collaboration level connecting the sending task/event in one pool to the receiving task/event in another.
Self-validate against the 16 syntactic checks: Verify start/end events exist, all flow objects are connected via sequence flows, no dangling gateways, gateway in/out degrees are correct, tasks have labels, message flows do not cross within a single pool, and pools contain processes.
Omit layout information: Do not generate <bpmndi:BPMNDiagram> elements. Layout should be added algorithmically by a rendering tool (e.g., bpmn-js auto-layout) after generation.
Validate the XML: Check the output parses as well-formed XML and conforms to the BPMN 2.0 XSD schema. If validation fails, identify the specific errors and correct them in one refinement pass.
Output the final BPMN XML with <?xml version="1.0" encoding="UTF-8"?> declaration and proper namespace declarations (xmlns:bpmn, xmlns:bpmndi, xmlns:xsi).
Validity check: Parse the XML and validate against the BPMN 2.0 XSD. Score: 1.0 if valid, 0.0 if not. If invalid, report specific schema violations and stop (other dimensions are meaningless for invalid XML).
Syntactic quality audit: Check all 16 rules -- start event presence, end event presence, sequence-flow source/target validity, gateway balance, pool-process binding, task labeling, message-flow directionality, event connectivity degrees. Report each as pass/fail and compute the average.
Pragmatic quality assessment: Compute structural metrics (node count, gateway count, sequence-flow count, diameter, density, average connector degree, control-flow complexity, sequentiality, separability, depth, token splits). Normalize each against BEF4LLM thresholds and report the five-point score.
Semantic quality comparison (requires a reference model): Align nodes via bipartite matching on label similarity, then compute label-similarity scores, graph-edit distance, common-nodes/edges ratio, causal-footprint overlap, and dependency-graph overlap. Average all seven metrics.
Aggregate: Report Q_total = (Q_syn + Q_prag + Q_sem + Q_val) / 4 and highlight the weakest dimension with specific remediation advice.
Example 1: Simple Order Fulfillment Process
User: "Convert this to BPMN XML: A customer places an order. The sales team receives the order and checks inventory. If items are in stock, the warehouse ships the order and the customer is notified. If items are out of stock, the sales team notifies the customer of a delay."
Approach:
Output:
<?xml version="1.0" encoding="UTF-8"?>
<bpmn:definitions xmlns:bpmn="http://www.omg.org/spec/BPMN/20100524/MODEL"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
id="Definitions_1" targetNamespace="http://example.org/bpmn">
<bpmn:collaboration id="Collab_1">
<bpmn:participant id="P_Customer" name="Customer" processRef="Proc_Customer"/>
<bpmn:participant id="P_Sales" name="Sales Team" processRef="Proc_Sales"/>
<bpmn:participant id="P_Warehouse" name="Warehouse" processRef="Proc_Warehouse"/>
<bpmn:messageFlow id="MF_1" sourceRef="Task_PlaceOrder" targetRef="Task_ReceiveOrder"/>
<bpmn:messageFlow id="MF_2" sourceRef="Task_ShipOrder" targetRef="Task_ReceiveShipment"/>
<bpmn:messageFlow id="MF_3" sourceRef="Task_NotifyDelay" targetRef="Task_ReceiveDelay"/>
</bpmn:collaboration>
<bpmn:process id="Proc_Customer" isExecutable="false">
<bpmn:startEvent id="SE_Cust" name="Order Needed"/>
<bpmn:task id="Task_PlaceOrder" name="Place Order"/>
<bpmn:task id="Task_ReceiveShipment" name="Receive Shipment Notification"/>
<bpmn:task id="Task_ReceiveDelay" name="Receive Delay Notification"/>
<bpmn:endEvent id="EE_Cust"/>
<bpmn:sequenceFlow id="SF_C1" sourceRef="SE_Cust" targetRef="Task_PlaceOrder"/>
</bpmn:process>
<bpmn:process id="Proc_Sales" isExecutable="false">
<bpmn:startEvent id="SE_Sales"/>
<bpmn:task id="Task_ReceiveOrder" name="Receive Order"/>
<bpmn:task id="Task_CheckInventory" name="Check Inventory"/>
<bpmn:exclusiveGateway id="GW_Stock" name="In Stock?"/>
<bpmn:task id="Task_NotifyDelay" name="Notify Customer of Delay"/>
<bpmn:endEvent id="EE_Sales1"/>
<bpmn:endEvent id="EE_Sales2"/>
<bpmn:sequenceFlow id="SF_S1" sourceRef="SE_Sales" targetRef="Task_ReceiveOrder"/>
<bpmn:sequenceFlow id="SF_S2" sourceRef="Task_ReceiveOrder" targetRef="Task_CheckInventory"/>
<bpmn:sequenceFlow id="SF_S3" sourceRef="Task_CheckInventory" targetRef="GW_Stock"/>
<bpmn:sequenceFlow id="SF_S4" sourceRef="GW_Stock" targetRef="EE_Sales1" name="In Stock"/>
<bpmn:sequenceFlow id="SF_S5" sourceRef="GW_Stock" targetRef="Task_NotifyDelay" name="Out of Stock"/>
<bpmn:sequenceFlow id="SF_S6" sourceRef="Task_NotifyDelay" targetRef="EE_Sales2"/>
</bpmn:process>
<bpmn:process id="Proc_Warehouse" isExecutable="false">
<bpmn:startEvent id="SE_WH"/>
<bpmn:task id="Task_ShipOrder" name="Ship Order"/>
<bpmn:endEvent id="EE_WH"/>
<bpmn:sequenceFlow id="SF_W1" sourceRef="SE_WH" targetRef="Task_ShipOrder"/>
<bpmn:sequenceFlow id="SF_W2" sourceRef="Task_ShipOrder" targetRef="EE_WH"/>
</bpmn:process>
</bpmn:definitions>
Example 2: Evaluating an Existing BPMN Model
User: "Evaluate the quality of this BPMN XML file against the BEF4LLM framework."
Approach:
Q_valOutput:
BEF4LLM Quality Assessment
===========================
Validity (Q_val): 1.0 -- XML is schema-valid.
Syntactic Quality (Q_syn): 0.875 (14/16 checks passed)
[PASS] Start event exists in each process
[PASS] End event exists in each process
[FAIL] Gateway GW_3 has mismatched split/join -- parallel split has no corresponding join
[PASS] All sequence flows have valid source and target refs
[FAIL] Task_7 has no outgoing sequence flow (dangling node)
... (12 more checks passed)
Pragmatic Quality (Q_prag): 0.70
- Node count: 18 (threshold: good)
- Control-flow complexity: 4.2 (threshold: moderate)
- Sequentiality: 0.65 (threshold: good)
- Depth: 3 (threshold: acceptable)
Semantic Quality (Q_sem): N/A (no reference model provided)
Total Score: (0.875 + 0.70 + 1.0) / 3 = 0.858
Recommendations:
1. Add a parallel join gateway to match the split at GW_3
2. Connect Task_7 to a downstream element or end event
Example 3: Fixing a Failed BPMN Model
User: "My BPMN XML fails validation. Can you fix it and tell me what was wrong?"
Approach:
targetNamespace, sequence flows referencing non-existent IDs, gateways without flowsOutput:
Validation Errors Found:
1. Missing targetNamespace attribute on <bpmn:definitions>
2. sequenceFlow "SF_5" references targetRef="Task_X" which does not exist
3. <bpmn:exclusiveGateway> "GW_1" has only one outgoing flow (needs >= 2)
Fixes Applied:
1. Added targetNamespace="http://example.org/bpmn"
2. Changed SF_5 targetRef to "Task_Review" (matching the described process)
3. Added second outgoing flow from GW_1 to the rejection path
Post-fix Syntactic Quality: 1.0 (16/16 checks passed)
<bpmndi:BPMNDiagram> layout elements from generated XML. Layout should be computed algorithmically by a renderer, not hallucinated by the model.| Problem | Cause | Resolution |
|---------|-------|------------|
| XML fails schema validation | Missing namespaces, malformed element names, invalid attribute values | Parse the XSD error message, fix the specific violation, re-validate. Allow one refinement pass. |
| Gateway mismatch | Split gateway has no corresponding join, or types differ | Trace all paths from the split and add a join gateway where paths reconverge. Ensure types match. |
| Dangling flow objects | Tasks or events with no incoming or outgoing sequence flows | Connect orphaned elements to the correct position in the process flow, or remove if they are artifacts of generation errors. |
| Missing message flows | Inter-participant communication described in text but absent in model | Add <bpmn:messageFlow> at the collaboration level connecting the appropriate send/receive elements across pools. |
| Semantic drift from description | Model includes tasks not in the description or omits described tasks | Re-read the source text, list all activities explicitly mentioned, and diff against the model's task labels. Use bipartite matching to identify gaps. |
Lauer, C., Pfeiffer, P., Rombach, A., & Mehdiyev, N. (2026). Assessing the Business Process Modeling Competences of Large Language Models. arXiv:2601.21787v1. https://arxiv.org/abs/2601.21787v1
Key takeaway: Look at Section 4 for the full metric definitions across all four BEF4LLM perspectives, and Section 5 for benchmark results showing that LLMs are competitive with human modelers on syntactic/pragmatic quality but lag on semantic accuracy and validity.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".