skills/recipes/improve-setup/SKILL.md
Expert AI engineering consultant for your LangWatch setup. Audits your codebase, traces, evaluations, and scenarios, then guides you to improve — starting from low-hanging fruit and going deeper. Use when you want to level up your agent's engineering quality.
npx skillsauth add langwatch/langwatch improve-setupInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This recipe acts as your expert AI engineering consultant. It audits everything, delivers quick fixes, then guides you deeper.
Before suggesting anything, read EVERYTHING:
git log --oneline -50 — read commit messages for WHY things changed. Bug fixes reveal edge cases. Refactors reveal design decisions. These are goldmines for what to test and evaluate.search_traces — check trace quality (inputs/outputs populated? spans connected? labels present?)platform_list_scenarios — what scenarios exist? Are they comprehensive or shallow?platform_list_evaluators — what evaluators are configured?platform_list_prompts — are prompts versioned or hardcoded?get_analytics — what's the cost, latency, error rate?Based on the audit, identify:
Fix the easiest, highest-impact issues first:
debug-instrumentation recipe)Deliver working results. Show the user what improved. This is the a-ha moment.
After Phase 2, DON'T STOP. Suggest 2-3 specific improvements based on what you learned:
Domain-specific improvements: Based on the codebase domain, suggest targeted scenarios or evaluations. "I noticed your agent handles [X] — should I add edge case tests for [Y]?"
Expert involvement: If the domain is specialized (medical, financial, legal), suggest involving domain experts. "For healthcare scenarios, you'd benefit from a medical professional reviewing the compliance criteria — want me to draft scenarios they can review?"
Data quality: If using synthetic data, suggest real data. "Do you have real customer queries or support tickets? Those would make much better evaluation datasets."
CI/CD integration: If no CI pipeline, suggest adding experiments. "Want me to set up experiments that run in CI to catch regressions?"
Production monitoring: If no online evaluation, suggest monitors. "Your traces show no quality monitoring — want me to set up faithfulness checks on production traffic?"
Ask light questions with options. Don't overwhelm — pick the top 2-3 most impactful.
After each improvement:
development
Add LangWatch tracing and observability to your code. Use for both onboarding (instrument an entire codebase) and targeted operations (add tracing to a specific function or module). Supports Python and TypeScript with all major frameworks.
tools
Test your AI agent with simulation-based scenarios. Covers writing scenario test code (Scenario SDK), creating platform scenarios (CLI or MCP), and red teaming for security vulnerabilities. Auto-detects whether to use code or platform approach based on context.
testing
Test that your AI agent stays observational and doesn't give prescriptive advice in regulated domains (healthcare, finance, legal). Creates scenario tests for boundary enforcement and red team tests for adversarial probing. Use when your agent advises but must not prescribe.
tools
Write scenario tests that verify your CLI tool is usable by AI agents. Ensures commands work non-interactively, provide clear output, and don't hang on prompts. Use when you want to prove your CLI is agent-friendly.