Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

anhvth/edit-llm-inference-style

Name: edit-llm-inference-style
Author: anhvth

.github/skills/edit-llm-inference-style/SKILL.md

npx skillsauth add anhvth/speedy_utils edit-llm-inference-style

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Edit LLM Inference Style (speedy_utils)

Use this skill to standardize or modify how speedy_utils builds prompts and consumes generation outputs. It focuses on chat templating, reasoning-style prefixes, and safe stopping rules for structured answer extraction.

When to Use This Skill

Use this skill when you need to:

Insert or enforce a reasoning prefix (for example, <think>\n).
Switch a flow to LLM.generate() instead of chat completion helpers.
Apply a tokenizer chat template before generating.
Stop generations on boxed answers (\boxed{}) or <|im_end|> tokens.
Normalize outputs before evaluation (for example, GSM8K or math tasks).

Prerequisites

A model-backed tokenizer available via transformers.AutoTokenizer.
A speedy_utils LLM instance configured to point at the correct backend.

Core Capabilities

1) Build a Chat-Templated Prompt

Use the tokenizer to format messages and append a generation prefix:

from transformers import AutoTokenizer

TOKENIZER_NAME = "Qwen/Qwen3-4B"
THINK_PREFIX = "<think>\n"

messages = [
    {"role": "system", "content": instruction},
    {"role": "user", "content": user_text},
]

tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME, trust_remote_code=True)
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
prompt = f"{prompt}{THINK_PREFIX}"

2) Generate with Low-Level `LLM.generate`

LLM.generate() routes to /inference/v1/generate, which works directly with token IDs or raw text.

result = llm.generate(
    prompt,
    max_tokens=512,
    temperature=temperature,
    stop=["<|im_end|>"],
)
text = result["text"]

3) Stop on Boxed Answers or `<|im_end|>`

Use a post-processing step to truncate output when a boxed answer appears:

import re

BOXED_PATTERN = re.compile(r"\\boxed\{.*?\}", re.DOTALL)


def truncate_completion(text: str) -> str:
    end_positions = []
    im_end_idx = text.find("<|im_end|>")
    if im_end_idx != -1:
        end_positions.append(im_end_idx)
    boxed_match = BOXED_PATTERN.search(text)
    if boxed_match:
        end_positions.append(boxed_match.end())
    if not end_positions:
        return text
    return text[: min(end_positions)]

4) Extract a Final Number Safely

Prefer the boxed span when available:

import re

BOXED_PATTERN = re.compile(r"\\boxed\{.*?\}", re.DOTALL)


def extract_final_number(text: str) -> str:
    boxed_match = BOXED_PATTERN.search(text)
    if boxed_match:
        text = boxed_match.group(0)
    nums = re.findall(r"-?\d+\.?\d*", text)
    return nums[-1] if nums else ""

Guidelines

Always apply the tokenizer chat template when the model expects chat-formatted input.
Use a consistent reasoning prefix (<think>\n) to reduce formatting drift.
Prefer LLM.generate() for low-level control of stop sequences and token handling.
Post-truncate output to avoid extra text after boxed answers or special tokens.
Keep evaluation logic (parsing + correctness) in the eval script, not in the LLM client.

Common Patterns

Pattern: GSM8K-style Generation

prompt = format_prompt(question, tokenizer)
raw_output = llm.generate(
    prompt,
    max_tokens=512,
    temperature=temperature,
    stop=["<|im_end|>"],
)["text"]
raw_output = truncate_completion(raw_output)
pred = extract_final_number(raw_output)

Related Files

src/llm_utils/lm/mixins.py: LLM.generate() implementation.
src/llm_utils/chat_format/transform.py: Chat templating utility.
docs/GENERATE_QUICKREF.md: generate() parameters and response format.

anhvth/edit-llm-inference-style

.github/skills/edit-llm-inference-style/SKILL.md

Guide for adjusting speedy_utils LLM inference style, including chat templating, <think> prefixes, stop sequences, and boxed-answer handling.

7 stars

documentation

Updated Apr 15, 2026

$ install --global

skillsauth

npx skillsauth add anhvth/speedy_utils edit-llm-inference-style

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 15, 2026, 6:08 PM4.2s1 file scanned

SKILL.md

name:: edit-llm-inference-style
description:: Guide for adjusting speedy_utils LLM inference style, including chat templating, <think> prefixes, stop sequences, and boxed-answer handling.

Edit LLM Inference Style (speedy_utils)

When to Use This Skill

Use this skill when you need to:

Insert or enforce a reasoning prefix (for example, <think>\n).
Switch a flow to LLM.generate() instead of chat completion helpers.
Apply a tokenizer chat template before generating.
Stop generations on boxed answers (\boxed{}) or <|im_end|> tokens.
Normalize outputs before evaluation (for example, GSM8K or math tasks).

Prerequisites

A model-backed tokenizer available via transformers.AutoTokenizer.
A speedy_utils LLM instance configured to point at the correct backend.

Core Capabilities

1) Build a Chat-Templated Prompt

Use the tokenizer to format messages and append a generation prefix:

from transformers import AutoTokenizer

TOKENIZER_NAME = "Qwen/Qwen3-4B"
THINK_PREFIX = "<think>\n"

messages = [
    {"role": "system", "content": instruction},
    {"role": "user", "content": user_text},
]

tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME, trust_remote_code=True)
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
prompt = f"{prompt}{THINK_PREFIX}"

2) Generate with Low-Level `LLM.generate`

LLM.generate() routes to /inference/v1/generate, which works directly with token IDs or raw text.

result = llm.generate(
    prompt,
    max_tokens=512,
    temperature=temperature,
    stop=["<|im_end|>"],
)
text = result["text"]

3) Stop on Boxed Answers or `<|im_end|>`

Use a post-processing step to truncate output when a boxed answer appears:

import re

BOXED_PATTERN = re.compile(r"\\boxed\{.*?\}", re.DOTALL)


def truncate_completion(text: str) -> str:
    end_positions = []
    im_end_idx = text.find("<|im_end|>")
    if im_end_idx != -1:
        end_positions.append(im_end_idx)
    boxed_match = BOXED_PATTERN.search(text)
    if boxed_match:
        end_positions.append(boxed_match.end())
    if not end_positions:
        return text
    return text[: min(end_positions)]

4) Extract a Final Number Safely

Prefer the boxed span when available:

import re

BOXED_PATTERN = re.compile(r"\\boxed\{.*?\}", re.DOTALL)


def extract_final_number(text: str) -> str:
    boxed_match = BOXED_PATTERN.search(text)
    if boxed_match:
        text = boxed_match.group(0)
    nums = re.findall(r"-?\d+\.?\d*", text)
    return nums[-1] if nums else ""

Guidelines

Always apply the tokenizer chat template when the model expects chat-formatted input.
Use a consistent reasoning prefix (<think>\n) to reduce formatting drift.
Prefer LLM.generate() for low-level control of stop sequences and token handling.
Post-truncate output to avoid extra text after boxed answers or special tokens.
Keep evaluation logic (parsing + correctness) in the eval script, not in the LLM client.

Common Patterns

Pattern: GSM8K-style Generation

prompt = format_prompt(question, tokenizer)
raw_output = llm.generate(
    prompt,
    max_tokens=512,
    temperature=temperature,
    stop=["<|im_end|>"],
)["text"]
raw_output = truncate_completion(raw_output)
pred = extract_final_number(raw_output)

Related Files

src/llm_utils/lm/mixins.py: LLM.generate() implementation.
src/llm_utils/chat_format/transform.py: Chat templating utility.
docs/GENERATE_QUICKREF.md: generate() parameters and response format.

Related Skills

anhvth/vision-utilities

documentation

VerifiedTrustedCommunity

Guide for using vision utilities in speedy_utils, including fast GPU image loading, memory-mapped datasets, and notebook visualization.

7SKILL.mdUpdated Apr 15, 2026

anhvth/vision-utilities

anhvth/skill-creation

development

VerifiedTrustedCommunity

Guide for creating new Agent Skills with proper structure, frontmatter, bundled assets, and validation. Includes templates, best practices, and examples for building reusable skill resources.

7SKILL.mdUpdated Apr 15, 2026

anhvth/skill-creation

anhvth/.github/skills/ray-distributed-computing

documentation

VerifiedTrustedCommunity

Comprehensive guide to using Ray for scalable distributed computing, including Ray Core, Data, Train, Tune, Serve, and RLlib with practical examples

7SKILL.mdUpdated Apr 15, 2026

anhvth/.github/skills/ray-distributed-computing

anhvth/multi-threading-processing

development

VerifiedTrustedCommunity

Comprehensive guide for using multi-threading and multi-processing in Python, including when to choose each approach, best practices, and practical examples using the speedy_utils library.

7SKILL.mdUpdated Apr 15, 2026

anhvth/multi-threading-processing

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/anhvth/speedy_utils.git

# Copy into Claude Code skills folder (global)
cp -r speedy_utils/.github/skills/edit-llm-inference-style ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

anhvth/speedy_utils

7 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

anhvth/edit-llm-inference-style

$ install --global

Security Scan Results

SKILL.md

Edit LLM Inference Style (speedy_utils)

When to Use This Skill

Prerequisites

Core Capabilities

1) Build a Chat-Templated Prompt

2) Generate with Low-Level LLM.generate

3) Stop on Boxed Answers or <|im_end|>

4) Extract a Final Number Safely

Guidelines

Common Patterns

Pattern: GSM8K-style Generation

Related Files

Related Skills

anhvth/vision-utilities

anhvth/skill-creation

anhvth/.github/skills/ray-distributed-computing

anhvth/multi-threading-processing

anhvth/edit-llm-inference-style

$ install --global

Security Scan Results

SKILL.md

Edit LLM Inference Style (speedy_utils)

When to Use This Skill

Prerequisites

Core Capabilities

1) Build a Chat-Templated Prompt

2) Generate with Low-Level LLM.generate

3) Stop on Boxed Answers or <|im_end|>

4) Extract a Final Number Safely

Guidelines

Common Patterns

Pattern: GSM8K-style Generation

Related Files

Related Skills

anhvth/vision-utilities

anhvth/skill-creation

anhvth/.github/skills/ray-distributed-computing

anhvth/multi-threading-processing

2) Generate with Low-Level `LLM.generate`

3) Stop on Boxed Answers or `<|im_end|>`

2) Generate with Low-Level `LLM.generate`

3) Stop on Boxed Answers or `<|im_end|>`