dist/plugins/api-ai-huggingface-inference/skills/api-ai-huggingface-inference/SKILL.md
Hugging Face Inference SDK patterns for TypeScript/Node.js — InferenceClient setup, chat completion, text generation, streaming, embeddings, image generation, audio transcription, translation, summarization, and Inference Endpoints
npx skillsauth add agents-inc/skills api-ai-huggingface-inferenceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Quick Guide: Use
@huggingface/inference(v4+) to access 200k+ ML models on the Hugging Face Hub. UseInferenceClientwithchatCompletion()for OpenAI-compatible chat,textGeneration()for raw text completion,chatCompletionStream()for streaming,featureExtraction()for embeddings,textToImage()for image generation, andautomaticSpeechRecognition()for audio transcription. Setproviderto route through inference providers (Cerebras, Together, Groq, etc.) or useendpointUrlfor dedicated Inference Endpoints.
<critical_requirements>
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST always pass an access token to InferenceClient -- never deploy without authentication)
(You MUST use chatCompletion() / chatCompletionStream() for conversational LLM tasks -- these follow the OpenAI-compatible message format)
(You MUST handle errors using InferenceClientError and its subclasses -- never use bare catch blocks without error type checking)
(You MUST specify a model parameter for every inference call -- there is no default model)
(You MUST never hardcode access tokens -- always use environment variables via process.env.HF_TOKEN)
</critical_requirements>
Auto-detection: Hugging Face, huggingface, @huggingface/inference, InferenceClient, HfInference, hf.chatCompletion, hf.textGeneration, hf.featureExtraction, hf.textToImage, hf.automaticSpeechRecognition, hf.translation, hf.summarization, hf.textToSpeech, chatCompletionStream, textGenerationStream, HF_TOKEN, inference provider, Inference Endpoints
When to use:
Key patterns covered:
When NOT to use:
@huggingface/hub package or Python transformersThe @huggingface/inference SDK provides a unified TypeScript client for accessing hundreds of thousands of ML models through multiple backends: serverless Inference Providers, dedicated Inference Endpoints, and local servers.
Core principles:
model parameter without code changes.provider parameter, or deploy your own Inference Endpoints.chatCompletion, textToImage, automaticSpeechRecognition), not raw HTTP endpoints.chatCompletion() uses the OpenAI message format (role + content), making migration between providers easy.chatCompletionStream() and textGenerationStream() return AsyncGenerator, consumed with for await...of.When to use Hugging Face Inference:
When NOT to use:
Initialize with your Hugging Face access token. The token is required for authenticated access.
// lib/hf-client.ts -- basic setup
import { InferenceClient } from "@huggingface/inference";
const client = new InferenceClient(process.env.HF_TOKEN);
export { client };
// lib/hf-client.ts -- with custom endpoint
const ENDPOINT_URL =
"https://your-endpoint.us-east-1.aws.endpoints.huggingface.cloud/v1/";
const client = new InferenceClient(process.env.HF_TOKEN, {
endpointUrl: ENDPOINT_URL,
});
export { client };
Why good: Token from env var, named constant for endpoint URL, named export
// BAD: Hardcoded token, no named export
const hf = new InferenceClient("hf_abc123xyz");
export default hf;
Why bad: Hardcoded token is a security risk, default export violates conventions
See: examples/core.md for provider routing, local endpoints, and endpoint helper
Use chatCompletion() for conversational LLM tasks. Follows the OpenAI message format.
const MAX_TOKENS = 512;
const TEMPERATURE = 0.1;
const response = await client.chatCompletion({
model: "Qwen/Qwen3-32B",
provider: "cerebras",
messages: [
{ role: "system", content: "You are a helpful coding assistant." },
{ role: "user", content: "Explain TypeScript generics." },
],
max_tokens: MAX_TOKENS,
temperature: TEMPERATURE,
});
console.log(response.choices[0].message.content);
Why good: Named constants for parameters, explicit model and provider, system message for behavior
// BAD: No model specified, magic numbers, no system message
const response = await client.chatCompletion({
messages: [{ role: "user", content: "do something" }],
max_tokens: 512,
temperature: 0.1,
});
Why bad: Missing required model, magic numbers, vague prompt, no system instruction
See: examples/core.md for multi-turn conversations and provider selection
Use chatCompletionStream() for streaming responses. Returns an AsyncGenerator.
const MAX_TOKENS = 512;
let fullResponse = "";
for await (const chunk of client.chatCompletionStream({
model: "Qwen/Qwen3-32B",
provider: "cerebras",
messages: [{ role: "user", content: "Explain async/await in TypeScript." }],
max_tokens: MAX_TOKENS,
})) {
if (chunk.choices && chunk.choices.length > 0) {
const content = chunk.choices[0].delta.content;
if (content) {
process.stdout.write(content);
fullResponse += content;
}
}
}
console.log(); // newline
Why good: Async generator consumed with for await, progressive output, null checks on chunk data
// BAD: Not checking chunk.choices, ignoring null content
for await (const chunk of client.chatCompletionStream({
model: "...",
messages: [],
})) {
process.stdout.write(chunk.choices[0].delta.content); // May throw on null
}
Why bad: No null check -- choices may be empty, content may be null between chunks
See: examples/core.md for text generation streaming
Use textGeneration() for prompt continuation without the chat message format.
const MAX_NEW_TOKENS = 250;
const result = await client.textGeneration({
model: "mistralai/Mixtral-8x7B-v0.1",
provider: "together",
inputs: "The key benefits of TypeScript are",
parameters: { max_new_tokens: MAX_NEW_TOKENS },
});
console.log(result.generated_text);
Why good: Named constant, clear prompt, explicit provider, direct access to generated_text
See: examples/core.md for streaming text generation
Use featureExtraction() for generating vector embeddings for semantic search and RAG.
const embeddings = await client.featureExtraction({
model: "sentence-transformers/all-MiniLM-L6-v2",
inputs: "That is a happy person",
});
// Returns: number[] (embedding vector)
Why good: Purpose-built embedding model, simple input/output
See: examples/tasks.md for batch embeddings and cosine similarity
Use textToImage() to generate images from text prompts. Returns a Blob.
const imageBlob = await client.textToImage({
model: "black-forest-labs/FLUX.1-dev",
inputs: "a serene mountain landscape at sunset",
provider: "replicate",
});
// imageBlob is a Blob -- write to file or convert to buffer
Why good: Explicit model and provider, descriptive prompt
See: examples/tasks.md for saving images, image-to-image, and output formats
Use automaticSpeechRecognition() for speech-to-text.
import { readFileSync } from "node:fs";
const result = await client.automaticSpeechRecognition({
model: "facebook/wav2vec2-large-960h-lv60-self",
data: readFileSync("audio/recording.flac"),
});
console.log(result.text);
Why good: Uses data parameter with file buffer, outputs .text
See: examples/tasks.md for Whisper models, audio classification, and text-to-speech
Always catch InferenceClientError and its subclasses. Re-throw unexpected errors.
import {
InferenceClientError,
InferenceClientInputError,
InferenceClientProviderApiError,
InferenceClientProviderOutputError,
InferenceClientHubApiError,
} from "@huggingface/inference";
try {
const result = await client.chatCompletion({
model: "Qwen/Qwen3-32B",
messages: [{ role: "user", content: "Hello" }],
});
} catch (error) {
if (error instanceof InferenceClientProviderApiError) {
console.error("Provider API error:", error.message);
console.error("Request:", error.request);
console.error("Response:", error.response);
} else if (error instanceof InferenceClientHubApiError) {
console.error("Hub API error:", error.message);
} else if (error instanceof InferenceClientProviderOutputError) {
console.error("Malformed provider response:", error.message);
} else if (error instanceof InferenceClientInputError) {
console.error("Invalid input:", error.message);
} else if (error instanceof InferenceClientError) {
console.error("Inference error:", error.message);
} else {
throw error; // Re-throw non-inference errors
}
}
Why good: Specific error types for each failure mode, request/response details for debugging, re-throws unexpected errors
See: examples/core.md for full error handling patterns
</patterns><decision_framework>
What is your task?
+-- Conversational LLM (messages) -> chatCompletion() / chatCompletionStream()
+-- Raw text continuation -> textGeneration() / textGenerationStream()
+-- Embeddings for search/RAG -> featureExtraction()
+-- Image from text prompt -> textToImage()
+-- Speech to text -> automaticSpeechRecognition()
+-- Text to speech -> textToSpeech()
+-- Language translation -> translation()
+-- Summarize long text -> summarization()
+-- Classify text -> textClassification()
+-- Named entity recognition -> tokenClassification()
+-- Classify image -> imageClassification()
+-- Detect objects -> objectDetection()
+-- Caption an image -> imageToText()
+-- Answer questions from context -> questionAnswering()
Do you have a conversation with roles (system/user/assistant)?
+-- YES -> chatCompletion() / chatCompletionStream()
| Uses OpenAI-compatible message format
| Supports system messages, multi-turn
+-- NO -> Do you want to continue/complete a text prompt?
+-- YES -> textGeneration() / textGenerationStream()
| Takes raw text input via 'inputs'
+-- NO -> Use a task-specific method instead
What are your deployment needs?
+-- Prototyping / low volume -> Serverless Inference Providers (provider: "auto")
| Free tier available, shared infrastructure, may have cold starts
+-- Production / high volume -> Inference Endpoints (endpointUrl)
| Dedicated GPU, autoscaling, scale-to-zero, private infrastructure
+-- Local development -> Local endpoint (endpointUrl: "http://localhost:8080")
Works with llama.cpp, Ollama, vLLM, TGI, LiteLLM
Do you need access to 200k+ open-source models?
+-- YES -> Use @huggingface/inference
+-- NO -> Do you only use OpenAI models?
+-- YES -> Not this skill's scope -- use the OpenAI SDK directly
+-- NO -> Do you need structured outputs / tool calling?
+-- YES -> Not this skill's scope -- use a higher-level AI SDK
+-- NO -> @huggingface/inference works for most ML tasks
</decision_framework>
<red_flags>
High Priority Issues:
catch blocks without checking InferenceClientError types (hides API errors, loses debug info)model parameter -- every inference call requires an explicit modelchatCompletionStream() / textGenerationStream() generators (tokens are silently lost)textGeneration() for conversational tasks instead of chatCompletion() (wrong API shape)Medium Priority Issues:
provider when a specific provider is needed (default "auto" picks based on your HF settings, which may not be optimal)chunk.choices length before accessing chunk.choices[0] in streaming (may throw on empty chunks)request() / streamingRequest() directly -- these are deprecated, use task-specific methodsmax_tokens / max_new_tokens limits (output may be truncated or excessively long)Common Mistakes:
chatCompletion() parameters with textGeneration() parameters -- chat uses messages + max_tokens, text generation uses inputs + parameters.max_new_tokensinputs parameter with chatCompletion() -- it uses messages, not inputsmessages parameter with textGeneration() -- it uses inputs, not messagestextToImage() returns a Blob, not a URL or BufferfeatureExtraction() output as always a flat array -- shape depends on the model (can be nested arrays for batch inputs)data (binary) for audio/image tasks, or passing a string path instead of the actual file bufferGotchas & Edge Cases:
provider: "auto" default selects providers based on your HF account settings at hf.co/settings/inference-providers -- not by availability or speed. Set an explicit provider for predictable routing.endpointUrl, the model parameter is often ignored because the endpoint serves a specific model.chatCompletion() is OpenAI-API compatible -- it works with any OpenAI-compatible endpoint, not just Hugging Face.HfInference is still exported for backward compatibility but InferenceClient is the current class name.accessToken -- when authenticated with a non-HF key, requests go directly to the provider instead of through HF's routing layer.import { textGeneration } from "@huggingface/inference") require passing accessToken as a parameter instead of constructor.textToImage() supports multiple output types: blob (default), url, dataUrl, or json via the options outputType parameter.parameters.src_lang and parameters.tgt_lang for many-to-many models like mbart-large-50-many-to-many-mmt.</red_flags>
<critical_reminders>
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST always pass an access token to InferenceClient -- never deploy without authentication)
(You MUST use chatCompletion() / chatCompletionStream() for conversational LLM tasks -- these follow the OpenAI-compatible message format)
(You MUST handle errors using InferenceClientError and its subclasses -- never use bare catch blocks without error type checking)
(You MUST specify a model parameter for every inference call -- there is no default model)
(You MUST never hardcode access tokens -- always use environment variables via process.env.HF_TOKEN)
Failure to follow these rules will produce insecure, unreliable, or silently failing AI integrations.
</critical_reminders>
development
Material Design component library for Vue 3
development
VitePress 1.x — Vue-powered static site generator for documentation sites, built on Vite
tools
Docusaurus 3.x documentation framework — site configuration, docs/blog plugins, sidebars, versioning, MDX, swizzling, and deployment
development
TanStack Form patterns - useForm, form.Field, validators, arrays, linked fields, createFormHook, type safety