Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

lebsral/dspy-best-of-n

Name: dspy-best-of-n
Author: lebsral

skills/dspy-best-of-n/SKILL.md

npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills dspy-best-of-n

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Pick the Best Output with dspy.BestOfN

Guide the user through using DSPy's BestOfN module to run a program multiple times and keep the highest-scoring result. This is rejection sampling -- generate N candidates, score each one, return the winner.

What is BestOfN

dspy.BestOfN wraps any DSPy module and calls it up to N times with temperature=1.0 (each attempt uses a different rollout ID to get diverse outputs). A reward function scores every result, and BestOfN returns the single best prediction.

If any attempt hits a score threshold you set, execution stops early -- no need to burn through all N attempts when you already have a great result.

Your module ──> Run N times ──> Score each with reward_fn ──> Return best

When to use BestOfN

You have a cheap, fast metric that can score outputs automatically (test suite passes, regex match, word count check, etc.)
Quality variance is high -- the same prompt sometimes produces great output and sometimes doesn't
You'd rather spend tokens than engineering time -- BestOfN is the simplest way to boost quality without optimization
You need a quick quality boost before investing in full prompt optimization with MIPROv2 or BootstrapFewShot

Do not use BestOfN when:

You have no way to automatically score outputs (you need a metric)
Latency matters more than quality (N calls take N times longer, unless you can parallelize)
Cost is a hard constraint and N is large

Basic usage

import dspy

lm = dspy.LM("openai/gpt-4o-mini")  # or "anthropic/claude-sonnet-4-5-20250929", etc.
dspy.configure(lm=lm)

# 1. Define your module
qa = dspy.ChainOfThought("question -> answer")

# 2. Define a reward function
def short_answer(args, pred):
    """Prefer concise single-word answers."""
    return 1.0 if len(pred.answer.split()) == 1 else 0.0

# 3. Wrap with BestOfN
best_qa = dspy.BestOfN(
    module=qa,
    N=3,
    reward_fn=short_answer,
    threshold=1.0,
)

# 4. Call it like any module
result = best_qa(question="What is the capital of Belgium?")
print(result.answer)

Constructor parameters

dspy.BestOfN(
    module,       # Any dspy.Module to run repeatedly
    N,            # Number of attempts (int)
    reward_fn,    # Scoring function: (args_dict, prediction) -> float
    threshold,    # Early-stop threshold: stop as soon as a score >= threshold
    fail_count=None,  # Max failures before raising an error (defaults to N)
)

| Parameter | Type | Description | |-----------|------|-------------| | module | dspy.Module | The module to run N times | | N | int | Maximum number of attempts | | reward_fn | Callable[[dict, Prediction], float] | Scores each prediction; higher is better | | threshold | float | If any attempt scores >= this value, return immediately | | fail_count | int \| None | How many attempts can fail (raise exceptions) before BestOfN itself raises. Defaults to N (all can fail before error) |

The reward function

The reward function is the core of BestOfN. It receives two arguments:

def reward_fn(args: dict, prediction: dspy.Prediction) -> float:
    # args: the keyword arguments you passed to the BestOfN call
    # prediction: the output from one attempt of the wrapped module
    # Return: a scalar score (higher = better)
    ...

Key differences from a dspy.Evaluate metric:

Signature: (args_dict, prediction) not (example, prediction, trace)
No gold labels: args contains only the inputs you passed, not expected outputs
No trace parameter: BestOfN doesn't use traces

Reward function examples

Binary pass/fail:

def passes_tests(args, pred):
    """Score 1.0 if generated code passes all tests, 0.0 otherwise."""
    try:
        exec(pred.code)
        return 1.0
    except Exception:
        return 0.0

Graded score:

def quality_score(args, pred):
    """Score summaries on length and keyword coverage."""
    score = 0.0
    # Prefer summaries under 100 words
    if len(pred.summary.split()) <= 100:
        score += 0.5
    # Reward covering key topics
    keywords = ["revenue", "growth", "forecast"]
    covered = sum(1 for kw in keywords if kw in pred.summary.lower())
    score += 0.5 * (covered / len(keywords))
    return score

Using an LM as judge inside the reward:

class JudgeQuality(dspy.Signature):
    """Rate the answer quality from 0.0 to 1.0."""
    question: str = dspy.InputField()
    answer: str = dspy.InputField()
    score: float = dspy.OutputField(desc="Quality score from 0.0 to 1.0")

judge = dspy.Predict(JudgeQuality)

def llm_reward(args, pred):
    result = judge(question=args["question"], answer=pred.answer)
    return result.score

Note: Using an LM as judge inside the reward function costs additional tokens per attempt. Reserve this for cases where programmatic scoring isn't feasible.

Tuning N

| N | Trade-off | |---|-----------| | 2-3 | Low cost, modest quality gain. Good starting point. | | 5 | Solid improvement for tasks with high variance. Sweet spot for most uses. | | 10+ | Diminishing returns unless your metric is very selective (e.g., <10% pass rate). |

Rule of thumb: if your base module succeeds ~50% of the time, N=3 gives you a ~87.5% chance of at least one success. If it succeeds ~20% of the time, you need N=8 for ~83%.

The math: probability of at least one success in N tries = 1 - (1 - p)^N where p is the single-attempt success rate.

How selection works internally

BestOfN calls your module with temperature=1.0 and a unique rollout ID for each attempt
Each attempt produces a dspy.Prediction
The reward function scores the prediction
If the score >= threshold, return immediately (early stopping)
If the attempt raises an exception, increment the failure counter
After all N attempts (or early stopping), return the prediction with the highest score
If failures exceed fail_count, raise an exception

The unique rollout IDs ensure the LM produces diverse outputs even with the same input. Temperature is fixed at 1.0 to maximize diversity.

Cost considerations

BestOfN multiplies your token usage by up to N times (fewer if early stopping kicks in). Budget accordingly:

| Base cost per call | N | Max cost | |--------------------|---|----------| | $0.01 | 3 | $0.03 | | $0.01 | 5 | $0.05 | | $0.01 | 10 | $0.10 |

Ways to manage cost:

Set a tight threshold so good results stop early (often after 1-2 attempts)
Use a cheap model as the base module and a stronger model only for the reward function
Start with N=3 and increase only if your metric shows it helps
Use programmatic reward functions (regex, test execution, length checks) instead of LM-based judges to avoid extra LM calls per attempt

BestOfN vs MultiChainComparison

Both BestOfN and dspy.MultiChainComparison aim to pick the best output from multiple candidates, but they work differently:

| | BestOfN | MultiChainComparison | |---|---------|---------------------| | Selection method | Your reward function scores each candidate | An LM reads all candidates and picks the best | | Metric required | Yes -- you must provide a reward_fn | No -- the LM decides what "best" means | | Token cost | N calls to your module (+ reward fn) | Multiple chain calls + one comparison call | | Best when | You have a clear, automatable scoring criterion | Quality is subjective or hard to score programmatically | | Optimizable | The wrapped module can be optimized | The comparison module can be optimized |

Use BestOfN when you can write a reward function. Use MultiChainComparison when you want the LM to judge quality using its own understanding.

Combining BestOfN with optimization

BestOfN works well as a complement to DSPy optimizers. Optimize your module first, then wrap the optimized version with BestOfN for an additional quality boost:

# Optimize the base module
optimizer = dspy.BootstrapFewShot(metric=metric, max_bootstrapped_demos=4)
optimized_qa = optimizer.compile(qa, trainset=trainset)

# Wrap the optimized module with BestOfN
best_qa = dspy.BestOfN(
    module=optimized_qa,
    N=3,
    reward_fn=my_reward,
    threshold=1.0,
)

This stacks two quality improvements: better prompts from the optimizer, and rejection sampling from BestOfN.

Gotchas

Claude writes the reward function with (example, prediction, trace=None) signature. BestOfN reward functions take (args_dict, prediction), not the (example, prediction, trace) signature used by dspy.Evaluate metrics. The args dict contains only the inputs you passed to the call, not labeled examples with gold outputs.
Claude sets N too high without considering cost. Each attempt is a full LM call at temperature=1.0. N=10 means 10x the token cost. Start with N=3 and increase only if your metric shows improvement — diminishing returns kick in quickly above N=5.
Claude uses BestOfN when the reward function is as expensive as the module itself. If your reward function calls an LM (e.g., LM-as-judge), each BestOfN attempt costs 2x tokens (one for the module, one for the judge). For N=5, that is 10 LM calls total. Use programmatic reward functions (test execution, regex, length checks) whenever possible.
Claude forgets to set threshold to enable early stopping. Without a meaningful threshold, BestOfN always runs all N attempts even when the first one is perfect. Set threshold to a value that represents "good enough" (e.g., 1.0 for binary pass/fail, 0.9 for graded metrics) to save tokens on easy inputs.
Claude wraps an already-optimized module but does not evaluate the incremental gain. BestOfN on top of an optimized module costs N times more per call at inference time. Always measure the quality gain from BestOfN separately to confirm the extra cost is justified — if the optimized module already hits 95%+, BestOfN may not add enough to be worth it.

Additional resources

dspy.BestOfN API docs
reference.md — constructor parameters, forward() method, key behaviors
examples.md — code generation with test-based selection, summarization with graded metric

Cross-references

Install any skill: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>

MultiChainComparison for LM-based candidate selection -- see /dspy-multi-chain-comparison
Evaluate for measuring quality with metrics and devsets -- see /dspy-evaluate
Improving accuracy for the full optimization workflow -- see /ai-improving-accuracy
Install /ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-do

lebsral/dspy-best-of-n

skills/dspy-best-of-n/SKILL.md

Use when output quality varies across runs and you want to sample multiple completions and pick the best — trading latency for reliability on high-stakes outputs. Common scenarios - generating multiple candidate answers and picking the highest-scoring one, improving reliability on high-stakes classification, reducing variance in creative generation, getting better summaries by sampling several and selecting the best, or trading latency for quality on critical decisions. Related - ai-improving-accuracy, ai-making-consistent. Also used for sample multiple completions, pick the best of several LLM outputs, majority voting for LLM, self-consistency decoding, reduce LLM output variance, generate and select pattern, best candidate selection, how to make AI more reliable by trying multiple times, brute force better quality, retry and pick best, dspy.BestOfN, quality vs latency tradeoff, n=5 completions pick best.

5 stars

testing

Updated May 5, 2026

$ install --global

skillsauth

npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills dspy-best-of-n

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 5, 2026, 8:01 AM195.3s4 files scanned

SKILL.md

name:: dspy-best-of-n
description:: Use when output quality varies across runs and you want to sample multiple completions and pick the best — trading latency for reliability on high-stakes outputs. Common scenarios - generating multiple candidate answers and picking the highest-scoring one, improving reliability on high-stakes classification, reducing variance in creative generation, getting better summaries by sampling several and selecting the best, or trading latency for quality on critical decisions. Related - ai-improving-accuracy, ai-making-consistent. Also used for sample multiple completions, pick the best of several LLM outputs, majority voting for LLM, self-consistency decoding, reduce LLM output variance, generate and select pattern, best candidate selection, how to make AI more reliable by trying multiple times, brute force better quality, retry and pick best, dspy.BestOfN, quality vs latency tradeoff, n=5 completions pick best.

Pick the Best Output with dspy.BestOfN

What is BestOfN

If any attempt hits a score threshold you set, execution stops early -- no need to burn through all N attempts when you already have a great result.

Your module ──> Run N times ──> Score each with reward_fn ──> Return best

When to use BestOfN

You have a cheap, fast metric that can score outputs automatically (test suite passes, regex match, word count check, etc.)
Quality variance is high -- the same prompt sometimes produces great output and sometimes doesn't
You'd rather spend tokens than engineering time -- BestOfN is the simplest way to boost quality without optimization
You need a quick quality boost before investing in full prompt optimization with MIPROv2 or BootstrapFewShot

Do not use BestOfN when:

You have no way to automatically score outputs (you need a metric)
Latency matters more than quality (N calls take N times longer, unless you can parallelize)
Cost is a hard constraint and N is large

Basic usage

import dspy

lm = dspy.LM("openai/gpt-4o-mini")  # or "anthropic/claude-sonnet-4-5-20250929", etc.
dspy.configure(lm=lm)

# 1. Define your module
qa = dspy.ChainOfThought("question -> answer")

# 2. Define a reward function
def short_answer(args, pred):
    """Prefer concise single-word answers."""
    return 1.0 if len(pred.answer.split()) == 1 else 0.0

# 3. Wrap with BestOfN
best_qa = dspy.BestOfN(
    module=qa,
    N=3,
    reward_fn=short_answer,
    threshold=1.0,
)

# 4. Call it like any module
result = best_qa(question="What is the capital of Belgium?")
print(result.answer)

Constructor parameters

dspy.BestOfN(
    module,       # Any dspy.Module to run repeatedly
    N,            # Number of attempts (int)
    reward_fn,    # Scoring function: (args_dict, prediction) -> float
    threshold,    # Early-stop threshold: stop as soon as a score >= threshold
    fail_count=None,  # Max failures before raising an error (defaults to N)
)

The reward function

The reward function is the core of BestOfN. It receives two arguments:

def reward_fn(args: dict, prediction: dspy.Prediction) -> float:
    # args: the keyword arguments you passed to the BestOfN call
    # prediction: the output from one attempt of the wrapped module
    # Return: a scalar score (higher = better)
    ...

Key differences from a dspy.Evaluate metric:

Signature: (args_dict, prediction) not (example, prediction, trace)
No gold labels: args contains only the inputs you passed, not expected outputs
No trace parameter: BestOfN doesn't use traces

Reward function examples

Binary pass/fail:

def passes_tests(args, pred):
    """Score 1.0 if generated code passes all tests, 0.0 otherwise."""
    try:
        exec(pred.code)
        return 1.0
    except Exception:
        return 0.0

Graded score:

def quality_score(args, pred):
    """Score summaries on length and keyword coverage."""
    score = 0.0
    # Prefer summaries under 100 words
    if len(pred.summary.split()) <= 100:
        score += 0.5
    # Reward covering key topics
    keywords = ["revenue", "growth", "forecast"]
    covered = sum(1 for kw in keywords if kw in pred.summary.lower())
    score += 0.5 * (covered / len(keywords))
    return score

Using an LM as judge inside the reward:

class JudgeQuality(dspy.Signature):
    """Rate the answer quality from 0.0 to 1.0."""
    question: str = dspy.InputField()
    answer: str = dspy.InputField()
    score: float = dspy.OutputField(desc="Quality score from 0.0 to 1.0")

judge = dspy.Predict(JudgeQuality)

def llm_reward(args, pred):
    result = judge(question=args["question"], answer=pred.answer)
    return result.score

Note: Using an LM as judge inside the reward function costs additional tokens per attempt. Reserve this for cases where programmatic scoring isn't feasible.

Tuning N

Rule of thumb: if your base module succeeds ~50% of the time, N=3 gives you a ~87.5% chance of at least one success. If it succeeds ~20% of the time, you need N=8 for ~83%.

The math: probability of at least one success in N tries = 1 - (1 - p)^N where p is the single-attempt success rate.

How selection works internally

BestOfN calls your module with temperature=1.0 and a unique rollout ID for each attempt
Each attempt produces a dspy.Prediction
The reward function scores the prediction
If the score >= threshold, return immediately (early stopping)
If the attempt raises an exception, increment the failure counter
After all N attempts (or early stopping), return the prediction with the highest score
If failures exceed fail_count, raise an exception

The unique rollout IDs ensure the LM produces diverse outputs even with the same input. Temperature is fixed at 1.0 to maximize diversity.

Cost considerations

BestOfN multiplies your token usage by up to N times (fewer if early stopping kicks in). Budget accordingly:

| Base cost per call | N | Max cost | |--------------------|---|----------| | $0.01 | 3 | $0.03 | | $0.01 | 5 | $0.05 | | $0.01 | 10 | $0.10 |

Ways to manage cost:

Set a tight threshold so good results stop early (often after 1-2 attempts)
Use a cheap model as the base module and a stronger model only for the reward function
Start with N=3 and increase only if your metric shows it helps
Use programmatic reward functions (regex, test execution, length checks) instead of LM-based judges to avoid extra LM calls per attempt

BestOfN vs MultiChainComparison

Both BestOfN and dspy.MultiChainComparison aim to pick the best output from multiple candidates, but they work differently:

Use BestOfN when you can write a reward function. Use MultiChainComparison when you want the LM to judge quality using its own understanding.

Combining BestOfN with optimization

BestOfN works well as a complement to DSPy optimizers. Optimize your module first, then wrap the optimized version with BestOfN for an additional quality boost:

# Optimize the base module
optimizer = dspy.BootstrapFewShot(metric=metric, max_bootstrapped_demos=4)
optimized_qa = optimizer.compile(qa, trainset=trainset)

# Wrap the optimized module with BestOfN
best_qa = dspy.BestOfN(
    module=optimized_qa,
    N=3,
    reward_fn=my_reward,
    threshold=1.0,
)

This stacks two quality improvements: better prompts from the optimizer, and rejection sampling from BestOfN.

Gotchas

Claude writes the reward function with (example, prediction, trace=None) signature. BestOfN reward functions take (args_dict, prediction), not the (example, prediction, trace) signature used by dspy.Evaluate metrics. The args dict contains only the inputs you passed to the call, not labeled examples with gold outputs.
Claude sets N too high without considering cost. Each attempt is a full LM call at temperature=1.0. N=10 means 10x the token cost. Start with N=3 and increase only if your metric shows improvement — diminishing returns kick in quickly above N=5.
Claude uses BestOfN when the reward function is as expensive as the module itself. If your reward function calls an LM (e.g., LM-as-judge), each BestOfN attempt costs 2x tokens (one for the module, one for the judge). For N=5, that is 10 LM calls total. Use programmatic reward functions (test execution, regex, length checks) whenever possible.
Claude forgets to set threshold to enable early stopping. Without a meaningful threshold, BestOfN always runs all N attempts even when the first one is perfect. Set threshold to a value that represents "good enough" (e.g., 1.0 for binary pass/fail, 0.9 for graded metrics) to save tokens on easy inputs.
Claude wraps an already-optimized module but does not evaluate the incremental gain. BestOfN on top of an optimized module costs N times more per call at inference time. Always measure the quality gain from BestOfN separately to confirm the extra cost is justified — if the optimized module already hits 95%+, BestOfN may not add enough to be worth it.

Additional resources

dspy.BestOfN API docs
reference.md — constructor parameters, forward() method, key behaviors
examples.md — code generation with test-based selection, summarization with graded metric

Cross-references

Install any skill: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>

MultiChainComparison for LM-based candidate selection -- see /dspy-multi-chain-comparison
Evaluate for measuring quality with metrics and devsets -- see /dspy-evaluate
Improving accuracy for the full optimization workflow -- see /ai-improving-accuracy
Install /ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-do

Related Skills

lebsral/ai-watching-optimization

tools

VerifiedTrustedCommunity

See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.

6SKILL.mdUpdated May 31, 2026

lebsral/ai-watching-optimization

lebsral/dspy-miprov2

testing

VerifiedTrustedCommunity

Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.

6SKILL.mdUpdated Apr 27, 2026

lebsral/dspy-langwatch

testing

VerifiedTrustedCommunity

Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.

6SKILL.mdUpdated Apr 27, 2026

lebsral/dspy-langwatch

lebsral/dspy-gepa

data-ai

VerifiedTrustedCommunity

Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.

6SKILL.mdUpdated Apr 27, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/lebsral/dspy-programming-not-prompting-lms-skills.git

# Copy into Claude Code skills folder (global)
cp -r dspy-programming-not-prompting-lms-skills/skills/dspy-best-of-n ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

lebsral/dspy-programming-not-prompting-lms-skills

5 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT