.github/skills/edit-llm-inference-style/SKILL.md
Guide for adjusting speedy_utils LLM inference style, including chat templating, <think> prefixes, stop sequences, and boxed-answer handling.
npx skillsauth add anhvth/speedy_utils edit-llm-inference-styleInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill to standardize or modify how speedy_utils builds prompts and consumes generation outputs. It focuses on chat templating, reasoning-style prefixes, and safe stopping rules for structured answer extraction.
Use this skill when you need to:
<think>\n).LLM.generate() instead of chat completion helpers.\boxed{}) or <|im_end|> tokens.transformers.AutoTokenizer.LLM instance configured to point at the correct backend.Use the tokenizer to format messages and append a generation prefix:
from transformers import AutoTokenizer
TOKENIZER_NAME = "Qwen/Qwen3-4B"
THINK_PREFIX = "<think>\n"
messages = [
{"role": "system", "content": instruction},
{"role": "user", "content": user_text},
]
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME, trust_remote_code=True)
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
prompt = f"{prompt}{THINK_PREFIX}"
LLM.generateLLM.generate() routes to /inference/v1/generate, which works directly with token IDs or raw text.
result = llm.generate(
prompt,
max_tokens=512,
temperature=temperature,
stop=["<|im_end|>"],
)
text = result["text"]
<|im_end|>Use a post-processing step to truncate output when a boxed answer appears:
import re
BOXED_PATTERN = re.compile(r"\\boxed\{.*?\}", re.DOTALL)
def truncate_completion(text: str) -> str:
end_positions = []
im_end_idx = text.find("<|im_end|>")
if im_end_idx != -1:
end_positions.append(im_end_idx)
boxed_match = BOXED_PATTERN.search(text)
if boxed_match:
end_positions.append(boxed_match.end())
if not end_positions:
return text
return text[: min(end_positions)]
Prefer the boxed span when available:
import re
BOXED_PATTERN = re.compile(r"\\boxed\{.*?\}", re.DOTALL)
def extract_final_number(text: str) -> str:
boxed_match = BOXED_PATTERN.search(text)
if boxed_match:
text = boxed_match.group(0)
nums = re.findall(r"-?\d+\.?\d*", text)
return nums[-1] if nums else ""
<think>\n) to reduce formatting drift.LLM.generate() for low-level control of stop sequences and token handling.prompt = format_prompt(question, tokenizer)
raw_output = llm.generate(
prompt,
max_tokens=512,
temperature=temperature,
stop=["<|im_end|>"],
)["text"]
raw_output = truncate_completion(raw_output)
pred = extract_final_number(raw_output)
src/llm_utils/lm/mixins.py: LLM.generate() implementation.src/llm_utils/chat_format/transform.py: Chat templating utility.docs/GENERATE_QUICKREF.md: generate() parameters and response format.documentation
Guide for using vision utilities in speedy_utils, including fast GPU image loading, memory-mapped datasets, and notebook visualization.
development
Guide for creating new Agent Skills with proper structure, frontmatter, bundled assets, and validation. Includes templates, best practices, and examples for building reusable skill resources.
documentation
Comprehensive guide to using Ray for scalable distributed computing, including Ray Core, Data, Train, Tune, Serve, and RLlib with practical examples
development
Comprehensive guide for using multi-threading and multi-processing in Python, including when to choose each approach, best practices, and practical examples using the speedy_utils library.