Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

langwatch/improve-setup

Name: improve-setup
Author: langwatch

skills/recipes/improve-setup/SKILL.md

npx skillsauth add langwatch/langwatch improve-setup

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Improve Your LangWatch Setup

This recipe acts as your expert AI engineering consultant. It audits everything, delivers quick fixes, then guides you deeper.

Phase 1: Full Audit

Before suggesting anything, read EVERYTHING:

Code Audit

Read the full codebase — every file, every function, every system prompt
Study git log --oneline -50 — read commit messages for WHY things changed. Bug fixes reveal edge cases. Refactors reveal design decisions. These are goldmines for what to test and evaluate.
Read README, docs, comments for domain context

LangWatch Audit (via MCP)

Call search_traces — check trace quality (inputs/outputs populated? spans connected? labels present?)
Call platform_list_scenarios — what scenarios exist? Are they comprehensive or shallow?
Call platform_list_evaluators — what evaluators are configured?
Call platform_list_prompts — are prompts versioned or hardcoded?
Call get_analytics — what's the cost, latency, error rate?

Gap Analysis

Based on the audit, identify:

What's missing entirely (no scenarios? no evaluations? no prompt versioning?)
What exists but is weak (generic datasets? shallow scenarios? broken traces?)
What's working well (keep and build on)

Phase 2: Low-Hanging Fruit

Fix the easiest, highest-impact issues first:

Broken instrumentation → fix traces (see debug-instrumentation recipe)
Hardcoded prompts → set up prompt versioning
No tests at all → create initial scenario tests
Generic datasets → generate domain-specific ones

Deliver working results. Show the user what improved. This is the a-ha moment.

Phase 3: Guide Deeper

After Phase 2, DON'T STOP. Suggest 2-3 specific improvements based on what you learned:

Domain-specific improvements: Based on the codebase domain, suggest targeted scenarios or evaluations. "I noticed your agent handles [X] — should I add edge case tests for [Y]?"
Expert involvement: If the domain is specialized (medical, financial, legal), suggest involving domain experts. "For healthcare scenarios, you'd benefit from a medical professional reviewing the compliance criteria — want me to draft scenarios they can review?"
Data quality: If using synthetic data, suggest real data. "Do you have real customer queries or support tickets? Those would make much better evaluation datasets."
CI/CD integration: If no CI pipeline, suggest adding experiments. "Want me to set up experiments that run in CI to catch regressions?"
Production monitoring: If no online evaluation, suggest monitors. "Your traces show no quality monitoring — want me to set up faithfulness checks on production traffic?"

Ask light questions with options. Don't overwhelm — pick the top 2-3 most impactful.

Phase 4: Keep Iterating

After each improvement:

Show what was accomplished
Run any tests to verify
Ask what to tackle next
Stop when the user says "that's enough"

Common Mistakes

Do NOT skip the audit — you can't suggest improvements without understanding the current state
Do NOT give generic advice — every suggestion must be specific to this codebase
Do NOT overwhelm with 10 suggestions — pick the top 2-3
Do NOT skip running/verifying improvements

langwatch/improve-setup

skills/recipes/improve-setup/SKILL.md

Expert AI engineering consultant for your LangWatch setup. Audits your codebase, traces, evaluations, and scenarios, then guides you to improve — starting from low-hanging fruit and going deeper. Use when you want to level up your agent's engineering quality.

3,203 stars

development

Updated Apr 15, 2026

$ install --global

skillsauth

npx skillsauth add langwatch/langwatch improve-setup

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 8:47 PM1.7s1 file scanned

SKILL.md

name:: improve-setup
description:: Expert AI engineering consultant for your LangWatch setup. Audits your codebase, traces, evaluations, and scenarios, then guides you to improve — starting from low-hanging fruit and going deeper. Use when you want to level up your agent's engineering quality.
license:: MIT
compatibility:: Requires LangWatch MCP with API key. Works with Claude Code and similar coding agents.
category:: recipe

Improve Your LangWatch Setup

This recipe acts as your expert AI engineering consultant. It audits everything, delivers quick fixes, then guides you deeper.

Phase 1: Full Audit

Before suggesting anything, read EVERYTHING:

Code Audit

Read the full codebase — every file, every function, every system prompt
Study git log --oneline -50 — read commit messages for WHY things changed. Bug fixes reveal edge cases. Refactors reveal design decisions. These are goldmines for what to test and evaluate.
Read README, docs, comments for domain context

LangWatch Audit (via MCP)

Call search_traces — check trace quality (inputs/outputs populated? spans connected? labels present?)
Call platform_list_scenarios — what scenarios exist? Are they comprehensive or shallow?
Call platform_list_evaluators — what evaluators are configured?
Call platform_list_prompts — are prompts versioned or hardcoded?
Call get_analytics — what's the cost, latency, error rate?

Gap Analysis

Based on the audit, identify:

What's missing entirely (no scenarios? no evaluations? no prompt versioning?)
What exists but is weak (generic datasets? shallow scenarios? broken traces?)
What's working well (keep and build on)

Phase 2: Low-Hanging Fruit

Fix the easiest, highest-impact issues first:

Broken instrumentation → fix traces (see debug-instrumentation recipe)
Hardcoded prompts → set up prompt versioning
No tests at all → create initial scenario tests
Generic datasets → generate domain-specific ones

Deliver working results. Show the user what improved. This is the a-ha moment.

Phase 3: Guide Deeper

After Phase 2, DON'T STOP. Suggest 2-3 specific improvements based on what you learned:

Domain-specific improvements: Based on the codebase domain, suggest targeted scenarios or evaluations. "I noticed your agent handles [X] — should I add edge case tests for [Y]?"
Expert involvement: If the domain is specialized (medical, financial, legal), suggest involving domain experts. "For healthcare scenarios, you'd benefit from a medical professional reviewing the compliance criteria — want me to draft scenarios they can review?"
Data quality: If using synthetic data, suggest real data. "Do you have real customer queries or support tickets? Those would make much better evaluation datasets."
CI/CD integration: If no CI pipeline, suggest adding experiments. "Want me to set up experiments that run in CI to catch regressions?"
Production monitoring: If no online evaluation, suggest monitors. "Your traces show no quality monitoring — want me to set up faithfulness checks on production traffic?"

Ask light questions with options. Don't overwhelm — pick the top 2-3 most impactful.

Phase 4: Keep Iterating

After each improvement:

Show what was accomplished
Run any tests to verify
Ask what to tackle next
Stop when the user says "that's enough"

Common Mistakes

Do NOT skip the audit — you can't suggest improvements without understanding the current state
Do NOT give generic advice — every suggestion must be specific to this codebase
Do NOT overwhelm with 10 suggestions — pick the top 2-3
Do NOT skip running/verifying improvements

Related Skills

langwatch/tracing

development

VerifiedTrustedCommunity

Add LangWatch tracing and observability to your code. Use for both onboarding (instrument an entire codebase) and targeted operations (add tracing to a specific function or module). Supports Python and TypeScript with all major frameworks.

3,203SKILL.mdUpdated Apr 15, 2026

langwatch/scenarios

tools

VerifiedTrustedCommunity

Test your AI agent with simulation-based scenarios. Covers writing scenario test code (Scenario SDK), creating platform scenarios (CLI or MCP), and red teaming for security vulnerabilities. Auto-detects whether to use code or platform approach based on context.

3,203SKILL.mdUpdated Apr 15, 2026

langwatch/test-compliance

testing

VerifiedTrustedCommunity

Test that your AI agent stays observational and doesn't give prescriptive advice in regulated domains (healthcare, finance, legal). Creates scenario tests for boundary enforcement and red team tests for adversarial probing. Use when your agent advises but must not prescribe.

3,203SKILL.mdUpdated Apr 15, 2026

langwatch/test-compliance

langwatch/test-cli-usability

tools

VerifiedTrustedCommunity

Write scenario tests that verify your CLI tool is usable by AI agents. Ensures commands work non-interactively, provide clear output, and don't hang on prompts. Use when you want to prove your CLI is agent-friendly.

3,203SKILL.mdUpdated Apr 15, 2026

langwatch/test-cli-usability

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/langwatch/langwatch.git

# Copy into Claude Code skills folder (global)
cp -r langwatch/skills/recipes/improve-setup ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

langwatch/langwatch

3,203 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT