.claude/skills/ts-cerebras/SKILL.md
Expert guidance for Cerebras Inference, the ultra-fast LLM inference service powered by the world's largest chip (Wafer-Scale Engine). Helps developers integrate Cerebras' API for applications requiring the fastest possible token generation — real-time chat, code completion, and interactive AI experiences.
npx skillsauth add eliferjunior/Claude cerebrasInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Cerebras Inference, the ultra-fast LLM inference service powered by the world's largest chip (Wafer-Scale Engine). Helps developers integrate Cerebras' API for applications requiring the fastest possible token generation — real-time chat, code completion, and interactive AI experiences.
// src/llm/cerebras.ts — Cerebras API (OpenAI-compatible)
import OpenAI from "openai";
const cerebras = new OpenAI({
apiKey: process.env.CEREBRAS_API_KEY!,
baseURL: "https://api.cerebras.ai/v1",
});
// Basic completion — up to 2000+ tokens/second
async function chat(prompt: string) {
const response = await cerebras.chat.completions.create({
model: "llama3.3-70b", // Llama 3.3 70B on Cerebras hardware
messages: [
{ role: "system", content: "You are a helpful coding assistant." },
{ role: "user", content: prompt },
],
temperature: 0.7,
max_tokens: 1024,
top_p: 1,
});
// Response includes usage with Cerebras-specific speed metrics
console.log(`Tokens/sec: ${response.usage?.completion_tokens! / (response.usage as any).completion_time}`);
return response.choices[0].message.content;
}
// Streaming — first token in <200ms
async function streamChat(prompt: string, onChunk: (text: string) => void) {
const stream = await cerebras.chat.completions.create({
model: "llama3.3-70b",
messages: [{ role: "user", content: prompt }],
stream: true,
});
let full = "";
for await (const chunk of stream) {
const text = chunk.choices[0]?.delta?.content ?? "";
full += text;
onChunk(text);
}
return full;
}
// JSON mode
async function structuredOutput(prompt: string) {
const response = await cerebras.chat.completions.create({
model: "llama3.3-70b",
messages: [{ role: "user", content: prompt }],
response_format: { type: "json_object" },
temperature: 0,
});
return JSON.parse(response.choices[0].message.content!);
}
async function chatWithTools(prompt: string) {
const response = await cerebras.chat.completions.create({
model: "llama3.3-70b",
messages: [{ role: "user", content: prompt }],
tools: [
{
type: "function",
function: {
name: "get_stock_price",
description: "Get the current stock price for a ticker symbol",
parameters: {
type: "object",
properties: {
ticker: { type: "string", description: "Stock ticker (e.g., AAPL)" },
},
required: ["ticker"],
},
},
},
],
tool_choice: "auto",
});
const msg = response.choices[0].message;
if (msg.tool_calls) {
// Execute the tool and send results back
const toolResults = await Promise.all(
msg.tool_calls.map(async (call) => {
const args = JSON.parse(call.function.arguments);
const result = await executeFunction(call.function.name, args);
return {
role: "tool" as const,
tool_call_id: call.id,
content: JSON.stringify(result),
};
})
);
// Get final response with tool results
const final = await cerebras.chat.completions.create({
model: "llama3.3-70b",
messages: [
{ role: "user", content: prompt },
msg,
...toolResults,
],
});
return final.choices[0].message.content;
}
return msg.content;
}
# src/cerebras_client.py — Cerebras with Python
from openai import OpenAI
client = OpenAI(
api_key=os.environ["CEREBRAS_API_KEY"],
base_url="https://api.cerebras.ai/v1",
)
# Chat completion
response = client.chat.completions.create(
model="llama3.3-70b",
messages=[{"role": "user", "content": "Write a Python quicksort implementation"}],
temperature=0.3,
max_tokens=500,
)
print(response.choices[0].message.content)
# Streaming
stream = client.chat.completions.create(
model="llama3.3-70b",
messages=[{"role": "user", "content": "Explain transformers in 5 sentences"}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
## Cerebras Models
- **llama3.3-70b** — Llama 3.3 70B, best quality, ~2000 tok/s output
- **llama3.1-8b** — Llama 3.1 8B, fastest option, ~2500+ tok/s output
- **llama3.1-70b** — Llama 3.1 70B, large context (128K tokens)
## Speed Comparison (approximate)
- Cerebras: 2000+ tok/s (70B model)
- Groq: 300-400 tok/s (70B model)
- Cloud GPU (A100): 50-80 tok/s (70B model)
- Local (M3 Max): 20-40 tok/s (70B quantized)
# Use any OpenAI-compatible SDK
npm install openai
pip install openai
# Set base_url to https://api.cerebras.ai/v1
User request:
I have a RAG chatbot that answers questions from our docs. Set up Cerebras to evaluate answer quality.
The agent creates an evaluation suite with appropriate metrics (faithfulness, relevance, answer correctness), configures test datasets from real user questions, runs baseline evaluations, and sets up CI integration so evaluations run on every prompt or retrieval change.
User request:
We're testing GPT-4o vs Claude on our customer support prompts. Set up a comparison with Cerebras.
The agent creates a structured experiment with the existing prompt set, configures both model providers, defines scoring criteria specific to customer support (accuracy, tone, completeness), runs the comparison, and generates a summary report with statistical significance indicators.
response_format: { type: "json_object" } for reliable parsingdevelopment
Expert guidance for Fireworks AI, the platform for running open-source LLMs (Llama, Mixtral, Qwen, etc.) with enterprise-grade speed and reliability. Helps developers integrate Fireworks' inference API, fine-tune models, and deploy custom model endpoints with function calling and structured output support.
development
Convert any website into clean, structured data with Firecrawl — API-first web scraping service. Use when someone asks to "turn a website into markdown", "scrape website for LLM", "Firecrawl", "extract website content as clean text", "crawl and convert to structured data", or "scrape website for RAG". Covers single-page scraping, full-site crawling, structured extraction, and LLM-ready output.
tools
Expert guidance for Firebase, Google's platform for building and scaling web and mobile applications. Helps developers set up authentication, Firestore/Realtime Database, Cloud Functions, hosting, storage, and analytics using Firebase's SDK and CLI.
development
When the user needs to build file upload functionality for a web application. Use when the user mentions "file upload," "image upload," "upload endpoint," "multipart upload," "presigned URL," "S3 upload," "file validation," "upload to cloud storage," or "accept user files." Handles upload endpoints, file validation (type, size, magic bytes), cloud storage integration, and upload status tracking. For image/video processing after upload, see media-transcoder.