integrations/claude-code-skills/arthur-onboard/arthur-onboard-evals/SKILL.md
--- name: arthur-onboard-evals description: Arthur onboarding sub-skill — Step 9: Recommend and configure continuous LLM evals for the Arthur task. Reads credentials and eval provider from .arthur-engine.env. allowed-tools: Bash, Read --- # Arthur Onboard — Step 9: Recommend & Configure Continuous Evals ## Read State ```bash cat .arthur-engine.env 2>/dev/null || echo "(no state file)" ``` Parse `ARTHUR_ENGINE_URL`, `ARTHUR_API_KEY`, `ARTHUR_TASK_ID`, `ARTHUR_EVAL_PROVIDER`, `ARTHUR_EVAL_MODE
npx skillsauth add arthur-ai/arthur-engine integrations/claude-code-skills/arthur-onboard/arthur-onboard-evalsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
cat .arthur-engine.env 2>/dev/null || echo "(no state file)"
Parse ARTHUR_ENGINE_URL, ARTHUR_API_KEY, ARTHUR_TASK_ID, ARTHUR_EVAL_PROVIDER, ARTHUR_EVAL_MODEL.
Skip if ARTHUR_EVAL_PROVIDER is empty or none — no eval model was configured in Step 8. Exit this skill.
curl -s \
-H "Authorization: Bearer $ARTHUR_API_KEY" \
"$ARTHUR_ENGINE_URL/api/v1/tasks/$ARTHUR_TASK_ID/continuous_evals?page=0&page_size=100" | \
python3 -c "
import sys, json
d = json.load(sys.stdin)
evals = d.get('evals', [])
for e in evals:
print(f' • {e[\"name\"]}')
print(f'COUNT={len(evals)}')
"
If evals already exist, show them and ask if the user wants additional recommendations.
TRACE_RESPONSE=$(curl -s \
-H "Authorization: Bearer $ARTHUR_API_KEY" \
"$ARTHUR_ENGINE_URL/api/v1/traces?task_ids=$ARTHUR_TASK_ID&page_size=5")
TRACE_ID=$(echo "$TRACE_RESPONSE" | \
python3 -c "
import sys, json
d = json.load(sys.stdin)
traces = d.get('traces') or d.get('data') or []
print(traces[0]['trace_id'] if traces else '')
" 2>/dev/null)
if [ -n "$TRACE_ID" ]; then
curl -s \
-H "Authorization: Bearer $ARTHUR_API_KEY" \
"$ARTHUR_ENGINE_URL/api/v1/traces/$TRACE_ID"
fi
Using the trace input/output content, recommend 2-4 continuous evals. Choose based on what the trace reveals about the application:
| Application type | Good eval choices | |-----------------|-------------------| | Customer support | Relevance, tone, task completion | | RAG / Q&A | Faithfulness (only if retrieval spans found), relevance | | Code assistant | Correctness, clarity | | General chat | Relevance, toxicity | | Agentic workflows | Task completion, relevance |
Important: Only recommend a faithfulness/hallucination eval if the trace contains retrieval spans (span_kind=RETRIEVER or span name matches retrieve/search/document/rag).
For each recommendation, write specific Jinja2-format instructions:
Evaluate the following LLM interaction for <quality dimension>:
Input: {{ input }}
Output: {{ output }}
<scoring criteria specific to the dimension>
Return a JSON object: {"score": <0.0-1.0 where 1.0 = fully passes>, "reason": "<brief explanation>"}
Show the recommendations with rationale and ask for user approval.
For each recommended eval, create the LLM eval rule:
curl -s -X POST \
-H "Authorization: Bearer $ARTHUR_API_KEY" \
-H "Content-Type: application/json" \
-d "{\"model_name\": \"$EVAL_MODEL\", \"model_provider\": \"$EVAL_PROVIDER\", \"instructions\": \"$INSTRUCTIONS\"}" \
"$ARTHUR_ENGINE_URL/api/v1/tasks/$ARTHUR_TASK_ID/llm_evals/$SLUG"
After all LLM evals are created, create one shared transform:
TRANSFORM_RESPONSE=$(curl -s -X POST \
-H "Authorization: Bearer $ARTHUR_API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"name\": \"Arthur — Input/Output Extractor\",
\"definition\": {
\"variables\": [
{\"variable_name\": \"input\", \"span_name\": \"$ROOT_SPAN_NAME\", \"attribute_path\": \"attributes.input.value\", \"fallback\": \"\"},
{\"variable_name\": \"output\", \"span_name\": \"$ROOT_SPAN_NAME\", \"attribute_path\": \"attributes.output.value\", \"fallback\": \"\"}
]
}
}" \
"$ARTHUR_ENGINE_URL/api/v1/tasks/$ARTHUR_TASK_ID/traces/transforms")
TRANSFORM_ID=$(echo "$TRANSFORM_RESPONSE" | python3 -c "import sys,json; print(json.load(sys.stdin).get('id',''))" 2>/dev/null)
Then create a continuous eval for each LLM eval:
curl -s -X POST \
-H "Authorization: Bearer $ARTHUR_API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"name\": \"$DISPLAY_NAME\",
\"llm_eval_name\": \"$SLUG\",
\"llm_eval_version\": \"latest\",
\"transform_id\": \"$TRANSFORM_ID\",
\"transform_variable_mapping\": [
{\"transform_variable\": \"input\", \"eval_variable\": \"input\"},
{\"transform_variable\": \"output\", \"eval_variable\": \"output\"}
],
\"enabled\": true
}" \
"$ARTHUR_ENGINE_URL/api/v1/tasks/$ARTHUR_TASK_ID/continuous_evals"
This step is non-blocking — log a warning and continue if it errors.
tools
--- name: arthur-onboard-verify description: Arthur onboarding sub-skill — Step 7: Verify that traces are flowing from the instrumented application to Arthur Engine. Reads credentials from .arthur-engine.env. allowed-tools: Bash, Read --- # Arthur Onboard — Step 7: Verify Instrumentation ## Read State ```bash cat .arthur-engine.env 2>/dev/null || echo "(no state file)" ``` Parse `ARTHUR_ENGINE_URL`, `ARTHUR_API_KEY`, `ARTHUR_TASK_ID`. --- ## Tell the User to Run Their Application Show the
tools
--- name: arthur-onboard-task description: Arthur onboarding sub-skill — Step 3: Set up an Arthur Task (create or select). Reads/writes .arthur-engine.env. allowed-tools: Bash, Read, Write, Edit --- # Arthur Onboard — Step 3: Set Up Arthur Task **Goal:** Establish `ARTHUR_TASK_ID` in `.arthur-engine.env`. ## Read State ```bash cat .arthur-engine.env 2>/dev/null || echo "(no state file)" ``` Parse `ARTHUR_ENGINE_URL`, `ARTHUR_API_KEY`, and `ARTHUR_TASK_ID` from the output. **State write hel
tools
--- name: arthur-onboard-prompts description: Arthur onboarding sub-skill — Step 6: Extract prompts from the target repository and register them with Arthur Engine. Reads credentials from .arthur-engine.env. allowed-tools: Bash, Read, Task --- # Arthur Onboard — Step 6: Extract & Register Prompts ## Read State ```bash cat .arthur-engine.env 2>/dev/null || echo "(no state file)" ``` Parse `ARTHUR_ENGINE_URL`, `ARTHUR_API_KEY`, `ARTHUR_TASK_ID`. --- ## Extract Prompts via Sub-agent Delegate
development
Onboard an agentic application to the Arthur SaaS Platform (platform.arthur.ai). Guides through authentication, workspace selection, engine deployment, model creation, code instrumentation, trace verification, and eval configuration.