.claude/skills/fal-ai-media/SKILL.md
Unified media generation via fal.ai MCP — image, video, and audio. Covers text-to-image (Nano Banana), text/image-to-video (Seedance, Kling, Veo 3), text-to-speech (CSM-1B), and video-to-audio (ThinkSound). Use when the user wants to generate images, videos, or audio with AI.
npx skillsauth add yusufcmg/Agent_Memory_Systems fal-ai-mediaInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Generate images, videos, and audio using fal.ai models via MCP.
fal.ai MCP server must be configured. Add to ~/.claude.json:
"fal-ai": {
"command": "npx",
"args": ["-y", "fal-ai-mcp-server"],
"env": { "FAL_KEY": "YOUR_FAL_KEY_HERE" }
}
Get an API key at fal.ai.
The fal.ai MCP provides these tools:
search — Find available models by keywordfind — Get model details and parametersgenerate — Run a model with parametersresult — Check async generation statusstatus — Check job statuscancel — Cancel a running jobestimate_cost — Estimate generation costmodels — List popular modelsupload — Upload files for use as inputsBest for: quick iterations, drafts, text-to-image, image editing.
generate(
app_id: "fal-ai/nano-banana-2",
input_data: {
"prompt": "a futuristic cityscape at sunset, cyberpunk style",
"image_size": "landscape_16_9",
"num_images": 1,
"seed": 42
}
)
Best for: production images, realism, typography, detailed prompts.
generate(
app_id: "fal-ai/nano-banana-pro",
input_data: {
"prompt": "professional product photo of wireless headphones on marble surface, studio lighting",
"image_size": "square",
"num_images": 1,
"guidance_scale": 7.5
}
)
| Param | Type | Options | Notes |
|-------|------|---------|-------|
| prompt | string | required | Describe what you want |
| image_size | string | square, portrait_4_3, landscape_16_9, portrait_16_9, landscape_4_3 | Aspect ratio |
| num_images | number | 1-4 | How many to generate |
| seed | number | any integer | Reproducibility |
| guidance_scale | number | 1-20 | How closely to follow the prompt (higher = more literal) |
Use Nano Banana 2 with an input image for inpainting, outpainting, or style transfer:
# First upload the source image
upload(file_path: "/path/to/image.png")
# Then generate with image input
generate(
app_id: "fal-ai/nano-banana-2",
input_data: {
"prompt": "same scene but in watercolor style",
"image_url": "<uploaded_url>",
"image_size": "landscape_16_9"
}
)
Best for: text-to-video, image-to-video with high motion quality.
generate(
app_id: "fal-ai/seedance-1-0-pro",
input_data: {
"prompt": "a drone flyover of a mountain lake at golden hour, cinematic",
"duration": "5s",
"aspect_ratio": "16:9",
"seed": 42
}
)
Best for: text/image-to-video with native audio generation.
generate(
app_id: "fal-ai/kling-video/v3/pro",
input_data: {
"prompt": "ocean waves crashing on a rocky coast, dramatic clouds",
"duration": "5s",
"aspect_ratio": "16:9"
}
)
Best for: video with generated sound, high visual quality.
generate(
app_id: "fal-ai/veo-3",
input_data: {
"prompt": "a bustling Tokyo street market at night, neon signs, crowd noise",
"aspect_ratio": "16:9"
}
)
Start from an existing image:
generate(
app_id: "fal-ai/seedance-1-0-pro",
input_data: {
"prompt": "camera slowly zooms out, gentle wind moves the trees",
"image_url": "<uploaded_image_url>",
"duration": "5s"
}
)
| Param | Type | Options | Notes |
|-------|------|---------|-------|
| prompt | string | required | Describe the video |
| duration | string | "5s", "10s" | Video length |
| aspect_ratio | string | "16:9", "9:16", "1:1" | Frame ratio |
| seed | number | any integer | Reproducibility |
| image_url | string | URL | Source image for image-to-video |
Text-to-speech with natural, conversational quality.
generate(
app_id: "fal-ai/csm-1b",
input_data: {
"text": "Hello, welcome to the demo. Let me show you how this works.",
"speaker_id": 0
}
)
Generate matching audio from video content.
generate(
app_id: "fal-ai/thinksound",
input_data: {
"video_url": "<video_url>",
"prompt": "ambient forest sounds with birds chirping"
}
)
For professional voice synthesis, use ElevenLabs directly:
import os
import requests
resp = requests.post(
"https://api.elevenlabs.io/v1/text-to-speech/<voice_id>",
headers={
"xi-api-key": os.environ["ELEVENLABS_API_KEY"],
"Content-Type": "application/json"
},
json={
"text": "Your text here",
"model_id": "eleven_turbo_v2_5",
"voice_settings": {"stability": 0.5, "similarity_boost": 0.75}
}
)
with open("output.mp3", "wb") as f:
f.write(resp.content)
If VideoDB is configured, use its generative audio:
# Voice generation
audio = coll.generate_voice(text="Your narration here", voice="alloy")
# Music generation
music = coll.generate_music(prompt="upbeat electronic background music", duration=30)
# Sound effects
sfx = coll.generate_sound_effect(prompt="thunder crack followed by rain")
Before generating, check estimated cost:
estimate_cost(
estimate_type: "unit_price",
endpoints: {
"fal-ai/nano-banana-pro": {
"unit_quantity": 1
}
}
)
Find models for specific tasks:
search(query: "text to video")
find(endpoint_ids: ["fal-ai/seedance-1-0-pro"])
models()
seed for reproducible results when iterating on promptsestimate_cost before running expensive video generationsvideodb — Video processing, editing, and streamingvideo-editing — AI-powered video editing workflowscontent-engine — Content creation for social platformsdevelopment
X/Twitter API integration for posting tweets, threads, reading timelines, search, and analytics. Covers OAuth auth patterns, rate limits, and platform-native content posting. Use when the user wants to interact with X programmatically.
documentation
Translate visa application documents (images) to English and create a bilingual PDF with original and translation
tools
See, Understand, Act on video and audio. See- ingest from local files, URLs, RTSP/live feeds, or live record desktop; return realtime context and playable stream links. Understand- extract frames, build visual/semantic/temporal indexes, and search moments with timestamps and auto-clips. Act- transcode and normalize (codec, fps, resolution, aspect ratio), perform timeline edits (subtitles, text/image overlays, branding, audio overlays, dubbing, translation), generate media assets (image, audio, video), and create real time alerts for events from live streams or desktop capture.
development
AI-assisted video editing workflows for cutting, structuring, and augmenting real footage. Covers the full pipeline from raw capture through FFmpeg, Remotion, ElevenLabs, fal.ai, and final polish in Descript or CapCut. Use when the user wants to edit video, cut footage, create vlogs, or build video content.