skills/gemini-live-api-dev/SKILL.md
Use this skill when building real-time, bidirectional streaming applications with the Gemini Live API. Covers WebSocket-based audio/video/text streaming, voice activity detection (VAD), native audio features, function calling, session management, ephemeral tokens for client-side auth, and all Live API configuration options. SDKs covered - google-genai (Python), @google/genai (JavaScript/TypeScript).
npx skillsauth add google-gemini/gemini-skill gemini-live-api-devInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The Live API enables low-latency, real-time voice and video interactions with Gemini over WebSockets. It processes continuous streams of audio, video, or text to deliver immediate, human-like spoken responses.
Key capabilities:
thinkingLevel)[!NOTE] The Live API currently only supports WebSockets. For WebRTC support or simplified integration, use a partner integration.
gemini-3.1-flash-live-preview — Optimized for low-latency, real-time dialogue. Native audio output, thinking (via thinkingLevel). 128k context window. This is the recommended model for all Live API use cases.[!WARNING] The following Live API models are deprecated and will be shut down. Migrate to
gemini-3.1-flash-live-preview.
gemini-2.5-flash-native-audio-preview-12-2025— Migrate togemini-3.1-flash-live-preview.gemini-live-2.5-flash-preview— Released June 17, 2025. Shutdown: December 9, 2025.gemini-2.0-flash-live-001— Released April 9, 2025. Shutdown: December 9, 2025.
google-genai — pip install google-genai@google/genai — npm install @google/genai[!WARNING] Legacy SDKs
google-generativeai(Python) and@google/generative-ai(JS) are deprecated. Use the new SDKs above.
To streamline real-time audio/video app development, use a third-party integration supporting the Gemini Live API over WebRTC or WebSockets:
audio/pcm;rate=16000[!IMPORTANT] Use
send_realtime_input/sendRealtimeInputfor all real-time user input (audio, video, and text).send_client_content/sendClientContentis only supported for seeding initial context history (requires settinginitial_history_in_client_contentinhistory_config). Do not use it to send new user messages during the conversation.
[!WARNING] Do not use
mediainsendRealtimeInput. Use the specific keys:audiofor audio data,videofor images/video frames, andtextfor text input.
from google import genai
client = genai.Client(api_key="YOUR_API_KEY")
import { GoogleGenAI } from '@google/genai';
const ai = new GoogleGenAI({ apiKey: 'YOUR_API_KEY' });
from google.genai import types
config = types.LiveConnectConfig(
response_modalities=[types.Modality.AUDIO],
system_instruction=types.Content(
parts=[types.Part(text="You are a helpful assistant.")]
)
)
async with client.aio.live.connect(model="gemini-3.1-flash-live-preview", config=config) as session:
pass # Session is active
const session = await ai.live.connect({
model: 'gemini-3.1-flash-live-preview',
config: {
responseModalities: ['audio'],
systemInstruction: { parts: [{ text: 'You are a helpful assistant.' }] }
},
callbacks: {
onopen: () => console.log('Connected'),
onmessage: (response) => console.log('Message:', response),
onerror: (error) => console.error('Error:', error),
onclose: () => console.log('Closed')
}
});
await session.send_realtime_input(text="Hello, how are you?")
session.sendRealtimeInput({ text: 'Hello, how are you?' });
await session.send_realtime_input(
audio=types.Blob(data=chunk, mime_type="audio/pcm;rate=16000")
)
session.sendRealtimeInput({
audio: { data: chunk.toString('base64'), mimeType: 'audio/pcm;rate=16000' }
});
# frame: raw JPEG-encoded bytes
await session.send_realtime_input(
video=types.Blob(data=frame, mime_type="image/jpeg")
)
session.sendRealtimeInput({
video: { data: frame.toString('base64'), mimeType: 'image/jpeg' }
});
[!IMPORTANT] A single server event can contain multiple content parts simultaneously (e.g., audio chunks and transcript). Always process all parts in each event to avoid missing content.
async for response in session.receive():
content = response.server_content
if content:
# Audio — process ALL parts in each event
if content.model_turn:
for part in content.model_turn.parts:
if part.inline_data:
audio_data = part.inline_data.data
# Transcription
if content.input_transcription:
print(f"User: {content.input_transcription.text}")
if content.output_transcription:
print(f"Gemini: {content.output_transcription.text}")
# Interruption
if content.interrupted is True:
pass # Stop playback, clear audio queue
// Inside the onmessage callback
const content = response.serverContent;
if (content?.modelTurn?.parts) {
for (const part of content.modelTurn.parts) {
if (part.inlineData) {
const audioData = part.inlineData.data; // Base64 encoded
}
}
}
if (content?.inputTranscription) console.log('User:', content.inputTranscription.text);
if (content?.outputTranscription) console.log('Gemini:', content.outputTranscription.text);
if (content?.interrupted) { /* Stop playback, clear audio queue */ }
TEXT or AUDIO per session, not bothWhen migrating from gemini-2.5-flash-native-audio-preview-12-2025 to gemini-3.1-flash-live-preview:
gemini-2.5-flash-native-audio-preview-12-2025 to gemini-3.1-flash-live-preview.thinkingLevel (minimal, low, medium, high) instead of thinkingBudget. Default is minimal for lowest latency.send_client_content is only for seeding initial context history (set initial_history_in_client_content in history_config). Use send_realtime_input for text during conversation.TURN_INCLUDES_AUDIO_ACTIVITY_AND_ALL_VIDEO instead of TURN_INCLUDES_ONLY_ACTIVITY. If sending constant video frames, consider sending only during audio activity to reduce costs.send_realtime_input for all real-time user input (audio, video, text). Reserve send_client_content only for seeding initial context historyaudioStreamEnd when the mic is paused to flush cached audioIf the search_docs tool (from the Google MCP server) is available, use it as your only documentation source:
search_docs with your query[!IMPORTANT] When MCP tools are present, never fetch URLs manually. MCP provides up-to-date, indexed documentation that is more accurate and token-efficient than URL fetching.
If no MCP documentation tools are available, fetch from the official docs index:
llms.txt URL: https://ai.google.dev/gemini-api/docs/llms.txt
This index contains links to all documentation pages in .md.txt format. Use web fetch tools to:
llms.txt to discover available documentation pageshttps://ai.google.dev/gemini-api/docs/live-session.md.txt)[!IMPORTANT] Those are not all the documentation pages. Use the
llms.txtindex to discover available documentation pages
The Live API supports 70 languages including: English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Hindi, Arabic, Russian, and many more. Native audio models automatically detect and switch languages.
development
Use this skill when writing code that calls the Gemini API for text generation, multi-turn chat, multimodal understanding, image generation, streaming responses, background research tasks, function calling, structured output, or migrating from the old generateContent API. This skill covers the Interactions API, the recommended way to use Gemini models and agents in Python and TypeScript.
development
Use this skill when building applications with Gemini API hosted models, including Gemini and Gemma 4, working with multimodal content (text, images, audio, video), implementing function calling, using structured outputs, or needing current model specifications. Covers SDK usage (google-genai for Python, @google/genai for JavaScript/TypeScript, com.google.genai:google-genai for Java, google.golang.org/genai for Go), model selection, and API capabilities.
tools
Guides the usage of Gemini API on Google Cloud Vertex AI with the Gen AI SDK. Use when the user asks about using Gemini in an enterprise environment or explicitly mentions Vertex AI. Covers SDK usage (Python, JS/TS, Go, Java, C#), capabilities like Live API, tools, multimedia generation, caching, and batch prediction.
tools
Use when work should span one or more detached tasks but still behave like one job with a single owner context. TaskFlow is the durable flow substrate under authoring layers like Lobster, ACPX, plugins, or plain code. Keep conditional logic in the caller; use TaskFlow for flow identity, child-task linkage, waiting state, revision-checked mutations, and user-facing emergence.