skills/darl-encouraging-diverse-answers/SKILL.md
Generate diverse, high-quality answer variants for open-ended tasks using DARL's bounded-diversity framework. Use when: 'generate multiple solutions', 'diverse answers to this prompt', 'brainstorm varied approaches', 'explore alternative implementations', 'rewrite this N different ways', 'what are different ways to solve this'.
npx skillsauth add ndpvt-web/arxiv-claude-skills darl-encouraging-diverse-answersInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill applies the DARL (Diverse Answers via Reinforcement Learning) framework to generate multiple high-quality, meaningfully distinct solutions to a problem while keeping each variant within a controlled deviation range from a reference answer. Instead of producing near-duplicate paraphrases or wildly divergent outputs, DARL's core insight is to reward diversity only when it stays within a confidence-scaled boundary of the reference quality -- producing answers that are genuinely different in approach but equivalently correct.
The Overfitting Problem. Standard approaches to generating multiple answers tend to cluster around a single "best" reference. If you ask for five implementations, you often get the same logic with superficial variable renaming. RLPR-style methods that train against reference answers exacerbate this by rewarding proximity to one canonical solution. DARL addresses this by adding a bounded diversity incentive.
Bounded Diversity Reward. DARL modifies the reward signal with a two-part formula: r_total = alpha * r_reference + beta * delta_r * I[delta_r <= r_reference / gamma]. Here, r_reference measures alignment with the reference answer, delta_r measures how much a candidate deviates from the reference, and the indicator function I[...] gates the diversity bonus so it only applies when deviation is within r_reference / gamma. The key hyperparameters are beta (diversity weight, default 0.01) and gamma (exploration bound, default 8-10). This means: small, controlled deviations from the reference are rewarded; large deviations that risk correctness are not.
Why This Works for Code Generation. When applied to generating diverse code solutions, the reference answer acts as a quality anchor. Variants that change algorithm choice, data structure, or control flow are rewarded (bounded deviation), while variants that introduce bugs or miss requirements are penalized (exceeding the bound). The gamma parameter naturally scales -- high-confidence problems (simple utility functions) allow wider exploration, while low-confidence problems (complex algorithms) constrain diversity to safer variants.
Establish the reference solution. Solve the problem once with your best approach. This becomes y*, the reference against which diversity is measured. Ensure it is correct and complete before proceeding.
Identify diversity axes. Analyze the reference to determine which dimensions can vary meaningfully: algorithm choice, data structure, API style (callback vs. promise vs. async/await), error handling strategy, naming conventions, architectural pattern, or level of abstraction.
Set the deviation bound. For well-defined problems (math, algorithms with known solutions), use a tight bound (gamma ~ 10): variants should differ in implementation but converge on the same result. For open-ended tasks (writing, architecture, UI), use a loose bound (gamma ~ 6-8): variants can differ more substantially in structure and style.
Generate candidate variants. For each variant, change exactly one or two diversity axes from the reference. Produce N candidates (typically 3-5), each exploring a different axis. Do not try to maximize difference -- aim for meaningful, purposeful divergence.
Apply the bounded filter. For each candidate, assess: does this variant still satisfy all the original requirements? If it introduces a correctness regression (fails a test case the reference passes, omits a requirement, introduces a vulnerability), it exceeds the deviation bound and should be discarded or revised.
Score and rank candidates. Rank remaining variants by the combined metric: quality (does it work as well as the reference?) plus diversity bonus (how different is it from the reference and from other candidates?). Prefer a set of variants that are mutually diverse, not just diverse from the reference.
Annotate tradeoffs. For each variant, document what it gains and what it costs relative to the reference. Be explicit: "This variant uses O(1) lookup but O(n) memory" or "This approach is more readable but less performant for large inputs."
Present as a structured comparison. Deliver the variants in a format that enables direct comparison: side-by-side code blocks, a tradeoff table, or a decision matrix. Never present diverse answers as an undifferentiated list.
Example 1: Diverse implementations of a rate limiter
User: "Show me 3 different ways to implement a rate limiter in Python."
Approach:
time.monotonic()Output:
# Variant A: Token Bucket (reference)
# Tradeoff: Smooth rate limiting, allows bursts up to bucket capacity
import time
class TokenBucketLimiter:
def __init__(self, rate: float, capacity: int):
self.rate = rate
self.capacity = capacity
self.tokens = capacity
self.last_refill = time.monotonic()
def allow(self) -> bool:
now = time.monotonic()
self.tokens = min(self.capacity, self.tokens + (now - self.last_refill) * self.rate)
self.last_refill = now
if self.tokens >= 1:
self.tokens -= 1
return True
return False
# Variant B: Sliding Window Log
# Tradeoff: Exact counting (no bursts), but O(n) memory per window
from collections import deque
import time
class SlidingWindowLimiter:
def __init__(self, max_requests: int, window_seconds: float):
self.max_requests = max_requests
self.window = window_seconds
self.timestamps = deque()
def allow(self) -> bool:
now = time.monotonic()
while self.timestamps and self.timestamps[0] <= now - self.window:
self.timestamps.popleft()
if len(self.timestamps) < self.max_requests:
self.timestamps.append(now)
return True
return False
# Variant C: Fixed Window Counter
# Tradeoff: O(1) memory and time, but allows 2x burst at window boundaries
import time
class FixedWindowLimiter:
def __init__(self, max_requests: int, window_seconds: float):
self.max_requests = max_requests
self.window = window_seconds
self.count = 0
self.window_start = time.monotonic()
def allow(self) -> bool:
now = time.monotonic()
if now - self.window_start >= self.window:
self.count = 0
self.window_start = now
if self.count < self.max_requests:
self.count += 1
return True
return False
| Variant | Burst Behavior | Memory | Precision | |---------|---------------|--------|-----------| | A: Token Bucket | Controlled bursts | O(1) | Approximate | | B: Sliding Window | No bursts | O(n) | Exact | | C: Fixed Window | 2x burst at edges | O(1) | Approximate |
Example 2: Diverse approaches to error handling in an API client
User: "What are different ways to handle errors in this HTTP client?"
Approach:
Output:
# Variant A: Exception hierarchy (reference)
# Philosophy: Fail fast, let callers decide recovery
class ApiError(Exception):
def __init__(self, status: int, message: str):
self.status = status
self.message = message
def fetch(url: str) -> dict:
resp = requests.get(url)
if not resp.ok:
raise ApiError(resp.status_code, resp.text)
return resp.json()
# Variant B: Result monad pattern
# Philosophy: Make errors explicit in return types, no hidden control flow
from dataclasses import dataclass
from typing import TypeVar, Generic
T = TypeVar("T")
@dataclass
class Ok(Generic[T]):
value: T
@dataclass
class Err:
status: int
message: str
Result = Ok[T] | Err
def fetch(url: str) -> Result[dict]:
resp = requests.get(url)
if not resp.ok:
return Err(resp.status_code, resp.text)
return Ok(resp.json())
# Variant C: Retry with circuit breaker
# Philosophy: Resilience over simplicity, self-healing on transient failures
import time
class CircuitBreaker:
def __init__(self, failure_threshold=3, reset_timeout=30):
self.failures = 0
self.threshold = failure_threshold
self.reset_timeout = reset_timeout
self.last_failure = 0.0
self.open = False
def fetch(self, url: str) -> dict:
if self.open:
if time.monotonic() - self.last_failure > self.reset_timeout:
self.open = False
else:
raise ConnectionError("Circuit open, backing off")
try:
resp = requests.get(url, timeout=5)
resp.raise_for_status()
self.failures = 0
return resp.json()
except requests.RequestException as e:
self.failures += 1
self.last_failure = time.monotonic()
if self.failures >= self.threshold:
self.open = True
raise
Example 3: Diverse SQL query strategies
User: "Give me different ways to find the top 5 customers by total spend."
Approach:
Output:
-- Variant A: Direct aggregation (reference)
-- Tradeoff: Simplest, works everywhere, may scan full table
SELECT customer_id, SUM(amount) AS total_spend
FROM orders
GROUP BY customer_id
ORDER BY total_spend DESC
LIMIT 5;
-- Variant B: CTE for readability and reuse
-- Tradeoff: Clearer intent, can reference totals multiple times
WITH customer_totals AS (
SELECT customer_id, SUM(amount) AS total_spend
FROM orders
GROUP BY customer_id
)
SELECT * FROM customer_totals
ORDER BY total_spend DESC
LIMIT 5;
-- Variant C: Window function (useful when you also need per-order detail)
-- Tradeoff: Returns order-level rows with rank, heavier but more flexible
SELECT * FROM (
SELECT customer_id, amount,
DENSE_RANK() OVER (ORDER BY SUM(amount) OVER (PARTITION BY customer_id) DESC) AS rnk
FROM orders
) ranked
WHERE rnk <= 5;
Paper: DARL: Encouraging Diverse Answers for General Reasoning without Verifiers (Huang et al., 2026). Look for: the bounded diversity reward formula (Section 3), the gamma/beta hyperparameter ablation (Section 5.3), and the WritingBench results showing where diversity matters most (Table 3).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".