skills/omni-image/SKILL.md
Generate, edit, describe, restyle, restore, thumbnail, and batch-process images using xAI (Grok) or OpenAI image APIs and GPT-4o vision. Default provider is xAI ($0.02/image flat rate). Use this skill whenever the user asks to generate, create, make, draw, or design an image or picture using AI, or wants to edit, modify, transform, restyle, composite, or inpaint an existing image. Also handles image description and alt-text generation, background removal, style transfer, photo restoration, thumbnail creation, and batch generation from JSON manifests. Trigger when the user mentions DALL-E, gpt-image, Grok image, xAI image, OpenAI image generation, or wants AI-generated visuals for any purpose (logos, mockups, illustrations, thumbnails, icons, concept art, memes). Also trigger for batch image generation, generating a set or series of images, processing multiple images from a manifest, or creating consistent image collections. If the user says "make me an image of...", "generate a picture", "edit this photo to...", "describe this image", "remove the background", "make this look like watercolor", "restore this old photo", "create a thumbnail", "generate a batch of images", or "process this image manifest", this is the skill to use.
npx skillsauth add baphomet480/claude-skills omni-imageInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
CRITICAL: For general chat inquiries ("what is this?", "describe this image", "extract text"), DO NOT USE the describe command.
describe command if you need structured JSON analysis or specific identity-lock metadata required for a subsequent tool call.Multi-provider image CLI. Default provider is xAI (Grok) at $0.02/image flat rate. OpenAI available via --provider openai.
| Provider | Default Model | Pricing | Key Env Var |
|----------|--------------|---------|-------------|
| xAI Pro (default) | grok-imagine-image-quality | $0.07/image flat | XAI_API_KEY |
| xAI Standard | grok-imagine-image | $0.02/image flat | XAI_API_KEY |
| OpenAI | gpt-image-2 | $0.009-$0.200/image (quality-dependent) | OPENAI_API_KEY |
Both providers use the OpenAI SDK under the hood. The --provider flag switches the API endpoint and default model. Quality defaults to high.
Both providers support: generate, edit, style-transfer, restore, thumbnail, batch. All commands work with xAI by default.
xAI advantages: flat pricing, more permissive content policy, fewer refusals on real people/public figures, stronger photorealism for human faces.
xAI watch-outs: defaults to cinematic/dramatic/oversaturated aesthetic. Requires --prefix steering for editorial or restrained styles. See Taming xAI's Aesthetic Bias.
OpenAI advantages: neutral aesthetic defaults, transparent backgrounds, masks, multi-image edit (multiple input files), fine-grained quality/compression control, describe command (vision), better prompt adherence for non-cinematic styles, better text rendering.
describe always uses OpenAI (xAI has no vision endpoint). Requires OPENAI_API_KEY even when xAI is the default provider.
Both keys live in ~/.secrets (sourced by shell profile). Run source ~/.secrets if they're not in your environment.
| Key | Where to get it | Env var |
|-----|----------------|---------|
| xAI (default provider) | console.x.ai > API Keys | XAI_API_KEY |
| OpenAI (describe, fallback) | platform.openai.com/api-keys | OPENAI_API_KEY |
source ~/.secrets
echo $XAI_API_KEY | head -c 10 # xAI (default)
echo $OPENAI_API_KEY | head -c 10 # needed for describe, or --provider openai
pip install openai
omni-image generate "A watercolor painting of a sunset over Mos Eisley" --output sunset.png
Create images from a text prompt.
# Basic generation (uses xAI by default, quality high)
omni-image generate "your prompt" --output result.png
# Specific size
omni-image generate "your prompt" --size 1536x1024 -o landscape.png
# Use OpenAI instead
omni-image --provider openai generate "your prompt" -o result.png
# xAI Pro model
omni-image generate "your prompt" --model grok-imagine-image-quality -o result.png
# Transparent background (OpenAI only, PNG)
omni-image --provider openai generate "a logo on transparent background" --background transparent -o logo.png
# Multiple images at once
omni-image generate "your prompt" -n 4 --output-dir ./variants/
# Compressed JPEG output
omni-image generate "your prompt" --format jpeg --compression 80 -o photo.jpg
CRITICAL: xAI (Grok) fully supports image editing / image-to-image generation. Do NOT refuse to edit an image because you think xAI lacks this feature.
Modify existing images with a text prompt. Optionally supply a mask to constrain edits.
# Edit a single image
omni-image edit "make the sky dramatic and stormy" -i photo.jpg -o dramatic.png
# Edit with a mask (transparent areas in the mask = regions to change)
omni-image edit "replace with a garden" -i room.jpg --mask mask.png -o garden_room.png
# Combine multiple images
omni-image edit "merge these into a collage with consistent lighting" -i img1.jpg img2.jpg img3.jpg -o collage.png
# High input fidelity (preserves more of the original style)
omni-image edit "add a hat" -i portrait.jpg --input-fidelity high -o hat.png
The --input-fidelity flag controls how much the output preserves the source image's structure:
high when you want to preserve the spatial layout of the source: walls, windows, furniture placement, body poses. Good for stylizing a venue photo while keeping the architecture intact, or retouching a portrait without changing the pose.low) when the source is a loose reference: you want the AI to use the shape or composition as a starting point but reimagine the contents freely. Good for filling an empty glass with a different liquid, or using a product shot as a structural anchor.Rule of thumb: if the edit prompt describes changing what's in the image, omit fidelity. If it describes changing how the image looks, use high.
Exception: When the image contains a person whose likeness must be preserved, always use high regardless of the edit type. See Preserving Likeness.
The most powerful edit pattern is using a photo as a structural anchor while completely reimagining its contents. Feed a product photo to edit not to modify the product, but to let the AI use its shape and proportions as a scaffold for something new.
# Use an empty coupe glass photo as a structural reference, reimagine the contents
omni-image edit \
"Fill this coupe glass with a bright blue butterfly pea tea cocktail, violet-shifting ice cubes, condensation on the glass" \
-i ref_empty_coupe.jpg --quality high -o cocktail_blue.png
# Use a rocks glass photo as a shape anchor for a completely different drink
omni-image edit \
"Golden amber old fashioned with a large ice sphere, orange peel garnish, smoke wisps" \
-i ref_rocks_glass.jpg --quality high -o cocktail_amber.png
# Use a venue photo as a layout reference for a different setting
omni-image edit \
"Transform this space into a 1920s speakeasy with warm Edison bulbs, dark wood, and brass fixtures" \
-i venue_photo.jpg --input-fidelity high -o speakeasy.png
Notice: the first two examples omit --input-fidelity because the glass shape is a loose reference. The third uses --input-fidelity high because the wall/window layout should be preserved.
Analyze images using GPT-4o vision. Returns alt text, captions, tags, or structured analysis.
# Generate alt text for web accessibility (default)
omni-image describe photo.jpg
# Get a natural language caption
omni-image describe photo.jpg --mode caption
# Detailed multi-paragraph description
omni-image describe photo.jpg --mode detailed
# Keyword tags
omni-image describe photo.jpg --mode tags
# Structured JSON (alt_text, caption, tags, colors, objects, scene)
omni-image describe photo.jpg --mode json
# Custom analysis
omni-image describe photo.jpg --custom "what fonts and colors are used in this design?"
# Multiple images
omni-image describe img1.jpg img2.png img3.webp
# Use the full gpt-4o model for better accuracy
omni-image describe photo.jpg --model gpt-4o
Remove background to transparent PNG.
omni-image bg-remove product.jpg -o product-nobg.png
Apply an art style to an image. 10 built-in presets plus custom.
Warning: Style transfer stylizes the entire image including faces. It will not preserve a person's likeness. If the user wants to stylize a photo of a person while keeping their face recognizable, use edit with the Identity Preservation Framework instead. For example, use edit with the 3-layer identity lock and a style directive like "Style: watercolor painting. Real textures." rather than style-transfer.
# Built-in styles: watercolor, oil-painting, pixel-art, pencil-sketch,
# anime, pop-art, art-deco, minimalist, cyberpunk, stained-glass
omni-image style-transfer photo.jpg --style watercolor -o watercolor.png
omni-image style-transfer photo.jpg --style pixel-art -o pixel.png
# Custom style
omni-image style-transfer photo.jpg --style custom --custom-style "1920s art nouveau poster" -o nouveau.png
Important: The built-in style presets apply technique only, not color palette. The watercolor preset produces cool, washed-out lavender tones by default. If you are building a cohesive page where style-transferred photos need to match AI-generated illustrations, the color mismatch will be visible.
Two fixes:
1. Steer color with --prefix (works with any preset):
# Warm watercolor instead of the default cool tones
omni-image --prefix "Warm golden amber and coral tones. Rich saturated palette." \
style-transfer venue.jpg --style watercolor -o venue_warm.png
# Apply the same color direction across a batch for consistency
PREFIX="Warm watercolor in golden amber, coral, and cream tones. Saturated, not washed out."
omni-image --prefix "$PREFIX" style-transfer photo1.jpg --style watercolor -o art1.png
omni-image --prefix "$PREFIX" style-transfer photo2.jpg --style watercolor -o art2.png
omni-image --prefix "$PREFIX" style-transfer photo3.jpg --style watercolor -o art3.png
2. Use --style custom for full control when presets aren't enough:
omni-image style-transfer venue.jpg \
--style custom \
--custom-style "Warm watercolor illustration. Golden amber, coral, and cream palette. Visible brush strokes, soft washes of color, paper texture. Rich saturated tones, not cool or washed out." \
-o venue_watercolor.png
When building cohesive visual pages, always use --prefix or --style custom with explicit color direction. The bare presets are fine for one-off transformations but produce inconsistent palettes across a series.
Restore damaged, faded, or degraded photographs. Uses high input fidelity by default.
omni-image restore old_photo.jpg -o restored.png
Generate web-optimized thumbnails (JPEG at 80% compression by default).
# From a text prompt
omni-image thumbnail "a cozy coffee shop interior" -o thumb.jpg
# From an existing image
omni-image thumbnail "clean product shot" --from-image product.jpg -o thumb.jpg
Process multiple image jobs from a JSON manifest. Each job can generate or edit independently, sharing a common style prefix and defaults.
omni-image --retries 3 batch drinks.json --output-dir ./public/images/
Manifest format (drinks.json):
{
"style_prefix": "Vivid, hyper-real 1920s cinematic movie still. Rich jewel tones, warm golden lighting, film grain.",
"defaults": {
"quality": "high",
"size": "1024x1024",
"model": "gpt-image-2",
"format": "png"
},
"jobs": [
{
"name": "cold_open",
"input": "ref_coupe.jpg",
"prompt": "Blue butterfly pea tea cocktail with violet-shifting ice cubes, condensation on glass",
"output": "drink_cold_open.png"
},
{
"name": "smoking_gun",
"input": "ref_rocks.jpg",
"prompt": "Golden amber with smoke cloche, large ice sphere, orange peel",
"output": "drink_smoking_gun.png"
},
{
"name": "hero_banner",
"prompt": "Elegant bar counter with three cocktails backlit by warm Edison bulbs",
"output": "hero_banner.png",
"size": "1536x1024"
}
]
}
Each job inherits from defaults and can override any field. Jobs with input use the edit API (reference-based generation); jobs without input use generate. The style_prefix is prepended to every job's prompt.
Batch also generates an index.html gallery in the output directory with thumbnails and job info. Open it in a browser to review all results at a glance.
Output is a summary JSON with per-job status:
{
"status": "success",
"message": "Batch complete: 3/3 succeeded",
"results": [
{"name": "cold_open", "status": "success", "path": "/abs/path/drink_cold_open.png"},
{"name": "smoking_gun", "status": "success", "path": "/abs/path/drink_smoking_gun.png"},
{"name": "hero_banner", "status": "success", "path": "/abs/path/hero_banner.png"}
]
}
These flags go before the subcommand name:
| Flag | Values | Default | Notes |
|------|--------|---------|-------|
| --provider | xai, openai | xai | API provider. Switches endpoint, default model, and API key env var. |
| --retries | 0-10 | 0 | Retry transient API errors with exponential backoff (1s, 2s, 4s... capped at 30s) |
| --prefix | string | none | Style preamble prepended to prompts in generate, edit, and style-transfer |
| --preset | draft, balanced, final | none | Quality preset. xAI: draft/balanced = grok-imagine-image, final = grok-imagine-image-quality. OpenAI: draft = mini/low, balanced = 1.5/medium, final = 1.5/high. |
| --dry-run | flag | off | Estimate cost in USD without making API calls. Works with all commands and batch. |
# Example: retry up to 3 times with a style prefix
omni-image --retries 3 --prefix "Photorealistic, 8K, shallow depth of field." generate "a cup of coffee" -o coffee.png
# Use a preset for quick iteration
omni-image --preset draft generate "concept sketch of a robot" -o robot_draft.png
# Estimate cost before running
omni-image --preset final --dry-run generate "hero image" -n 4
# Dry-run a whole batch manifest
omni-image --dry-run batch drinks.json
Presets map to model + quality combinations. Use them to switch between iteration and production without remembering flag combos:
| Preset | Model | Quality | Approx. Cost (square) |
|--------|-------|---------|-----------------------|
| draft | gpt-image-1-mini | low | $0.005 |
| balanced | gpt-image-2 | medium | $0.034 |
| final | gpt-image-2 | high | $0.133 |
If you pass --model or --quality explicitly, those override the preset values.
--dry-run calculates the estimated cost without calling the API. The output is JSON:
{
"status": "dry_run",
"estimated_cost_usd": 0.532,
"breakdown": [
{"model": "gpt-image-2", "quality": "high", "size": "1024x1024", "n": 1, "cost_usd": 0.133}
]
}
For batch manifests, the breakdown includes each job by name. When --quality is auto, the estimate uses medium pricing as a reasonable midpoint.
| Flag | Values | Default | Notes |
|------|--------|---------|-------|
| --model | xAI: grok-imagine-image-quality, grok-imagine-image. OpenAI: gpt-image-2, gpt-image-1, gpt-image-1-mini | provider-specific | Set by --provider. xAI default: grok-imagine-image-quality. OpenAI default: gpt-image-2. |
| --size | auto, 1024x1024, 1536x1024, 1024x1536, 4096x4096 | auto | xAI maps these to aspect ratios (1:1, 3:2, 2:3). OpenAI supports up to 4K. |
| --quality | auto, low, medium, high | high | Controls rendering fidelity on both providers. xAI pricing stays flat regardless of quality level. |
| --resolution | auto, 1k, 2k | auto | xAI only. Output pixel dimensions. 2k gives sharper detail at no extra cost. OpenAI ignores this. |
| --format | png, jpeg, webp | png | Controls output file format. Both providers save as this format. |
| --compression | 0-100 | none | JPEG/WebP quality (OpenAI only; xAI ignores) |
| --background | auto, transparent, opaque | auto | OpenAI only; xAI ignores. Transparent requires PNG or WebP. |
| -n | 1-10 | 1 | Number of images |
| -o / --output | file path | auto-named | Single image explicit path |
| --output-dir | directory | . | Where auto-named files go |
| --input-fidelity | low, high | low | Edit only, OpenAI only. xAI edit works without this flag. high preserves source layout; low uses source as loose reference. |
| Flag | Values | Default | Notes |
|------|--------|---------|-------|
| --mode | alt-text, caption, detailed, tags, json | alt-text | Output format for vision analysis |
| --custom | string | none | Freeform analysis prompt (overrides --mode) |
| --model | gpt-4o, gpt-4o-mini | gpt-4o-mini | Vision model; mini is cheaper, 4o is more accurate |
| Flag | Values | Default |
|------|--------|---------|
| --style | watercolor, oil-painting, pixel-art, pencil-sketch, anime, pop-art, art-deco, minimalist, cyberpunk, stained-glass, custom | required |
| --custom-style | string | none (required when --style custom) |
| Flag | Values | Default |
|------|--------|---------|
| --from-image | file path | none (generates from prompt if omitted) |
| --format | png, jpeg, webp | jpeg |
| --compression | 0-100 | 80 |
| Flag | Values | Default | Notes |
|------|--------|---------|-------|
| manifest (positional) | file path | required | Path to JSON manifest |
| --output-dir | directory | . | Base directory for output files |
Manifest fields: style_prefix (string), defaults (object with model/quality/size/format/compression/background/input_fidelity), jobs (array of objects with name/prompt/input/output and optional per-job overrides).
Output dimensions vary by model, quality, and --size. This table shows what to expect:
| Provider | Model | Quality Levels | Available Sizes / Aspect Ratios | Notes |
|----------|-------|---------------|--------------------------------|-------|
| xAI | grok-imagine-image | low, medium, high | 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3, 2:1, 1:2 | Default. $0.02/image flat. Use --resolution 2k for sharper output. |
| xAI | grok-imagine-image-quality | low, medium, high | same as above | $0.07/image flat. Best with --resolution 2k. Wins ~79% vs standard in head-to-head. |
| OpenAI | gpt-image-2 | low, medium, high, auto | 1024x1024, 1536x1024, 1024x1536, auto | Best OpenAI model |
| OpenAI | gpt-image-1-mini | low, medium, high, auto | 1024x1024, 1536x1024, 1024x1536, auto | Budget option |
When --size auto (the default), the API picks the best size for the prompt. For xAI, --size values are mapped to aspect ratios (e.g., 1536x1024 becomes 3:2). For predictable output, set size explicitly. Use 1536x1024 for landscape backgrounds and hero images, 1024x1024 for product shots and thumbnails, 1024x1536 for portrait/mobile.
Per-image costs in USD (verified April 2026).
xAI (default provider) -- flat rate regardless of quality/resolution/size:
| Model | Cost/image | Notes |
|-------|-----------|-------|
| grok-imagine-image | $0.02 | Default. 300 RPM. Supports quality and resolution params at no extra cost. |
| grok-imagine-image-quality | $0.07 | Higher fidelity, better prompt adherence, sharper at 2K resolution. 30 RPM. |
OpenAI (--provider openai) -- quality-dependent:
| Model | Quality | Square (1024x1024) | Landscape/Portrait (1536x) |
|-------|---------|-------------------|--------------------------|
| gpt-image-2 | low | $0.009 | $0.013 |
| gpt-image-2 | medium | $0.034 | $0.050 |
| gpt-image-2 | high | $0.133 | $0.200 |
| gpt-image-1-mini | low | $0.005 | $0.006 |
| gpt-image-1-mini | medium | $0.011 | $0.015 |
| gpt-image-1-mini | high | $0.036 | $0.052 |
Cost-aware usage for agents:
describe (vision), and --input-fidelity control are OpenAI-only features. Use --provider openai for those.gpt-4o-mini vision calls cost fractions of a cent per image. Use describe --mode json freely for analysis, alt text, and tagging. Always uses OpenAI regardless of --provider.The single most important principle in this skill: always prefer editing a real photo over generating from scratch. No text prompt, no matter how detailed, can capture what a photograph captures -- the specific geometry of a building, the exact way light falls across a bar counter, the grain of a wood table, or the proportions of a person's face. A photo grounds the generation in reality. Without one, the model hallucinates every detail.
This applies to:
generate creates from nothing. The model invents proportions, invents lighting, invents spatial relationships. Even with a perfect prompt, the result is a plausible fiction. edit and style-transfer start from truth -- the actual photo -- and transform it. The difference is visible immediately and becomes critical when the output represents something real that people will recognize.
Edit costs the same as generate. There is zero cost penalty for using a reference photo. The only cost is asking the user for one.
Does a real-world photo of the subject exist (or could the user take one)?
YES → Use edit or style-transfer with the photo as input
Is the goal to apply a uniform art style to the whole image?
YES → style-transfer (watercolor, pixel-art, oil-painting, etc.)
NO → edit with a descriptive prompt
Does the image contain a person whose face must be recognizable?
YES → edit with identity lock (see Preserving Likeness section)
NO → edit with appropriate --input-fidelity
NO → Use generate with a detailed prompt (Prompt Spec Scaffold below)
Illustrations of a real venue or place:
# WRONG: generates a generic bar that looks nothing like the real one
omni-image generate "watercolor illustration of The Lavender Farms cocktail bar" -o bar.png
# RIGHT: transforms the actual venue into a watercolor
omni-image style-transfer venue_photo.jpg --style watercolor -o bar_watercolor.png
# RIGHT: more control over the transformation
omni-image edit \
"Transform into a warm watercolor illustration. Preserve the room layout, bar position, and window placement. Soft washes of color, visible brush strokes, paper texture." \
-i venue_photo.jpg --input-fidelity high --quality high -o bar_watercolor.png
Product photography in a new context:
# WRONG: generates a generic bottle shape
omni-image generate "artisanal hot sauce bottle on marble counter" -o product.png
# RIGHT: uses the actual bottle with its real label, shape, and proportions
omni-image edit \
"Place on a white marble counter. Soft diffused studio lighting from above. Subtle shadow beneath. Clean white background." \
-i real_bottle_photo.jpg --quality high -o product_styled.png
Menu art from real dishes:
# Style-transfer for uniform illustration style across a menu
omni-image style-transfer risotto_photo.jpg --style watercolor -o menu_risotto.png
omni-image style-transfer steak_photo.jpg --style watercolor -o menu_steak.png
# Or edit for more photographic polish
omni-image edit \
"Fine dining food photography. Enhance plating, adjust lighting to warm directional from 10 o'clock. Deepen background blur." \
-i dish_photo.jpg --input-fidelity high --quality high -o menu_hero.png
Real building in a different context:
# Preserve architecture, change the surroundings
omni-image edit \
"Cover in fresh snow. Overcast winter sky. Warm light glowing from the windows. Footprints in the snow leading to the front door." \
-i storefront_summer.jpg --input-fidelity high --quality high -o storefront_winter.png
generateUse generate only when no reference photo exists or could exist:
Even then, consider whether a similar photo could serve as a structural anchor via edit.
When the user doesn't have a photo, the agent can search for one. Real places, landmarks, products, and public buildings all have photos available online. Use web search or image search to find a reference, download it, and feed it into edit or style-transfer.
# Search for a reference photo of a real place
firecrawl search "The Alamo San Antonio exterior photo" --sources images -o .firecrawl/alamo-ref.json --json
# Download the best result
curl -sL "$IMAGE_URL" -o ref_alamo.jpg
# Now use it as a reference for the illustration
omni-image style-transfer ref_alamo.jpg --style watercolor -o alamo_watercolor.png
This works for:
The downloaded reference does not need to be perfect. Even a mediocre photo of the right subject gives the model more to work with than the best text prompt describing it from scratch.
When the user asks for an image of something real:
style-transfer (uniform art style) and edit (selective control).generate from scratch as a last resort -- when no reference photo exists, can be taken, or can be found online.The GPT image models respond well to detailed, specific prompts. A few things that help:
For edits, describe the full desired result rather than just the change. "A portrait of a person wearing a red hat in a garden" works better than "add a hat".
When building prompts for image generation, use this structured template. Fill in each segment that applies, skip the rest. The agent should compose the final prompt by concatenating the filled segments into a single string.
[SUBJECT] What is the main focus? e.g. "A Bengal cat sitting on a stack of old books"
[STYLE] Art style or medium. e.g. "Hyper-real photograph" or "Ukiyo-e woodblock print"
[COMPOSITION] Camera angle and framing. e.g. "Close-up, shallow depth of field, rule of thirds"
[LIGHTING] Light source and quality. e.g. "Warm golden hour side-lighting, long shadows"
[COLOR] Palette or mood. e.g. "Muted earth tones with a pop of teal"
[BACKGROUND] Setting and context. e.g. "In a dimly lit library with leather-bound volumes"
[CONSTRAINTS] Technical limits. e.g. "No text, no watermarks, transparent background"
Example assembled prompt:
A Bengal cat sitting on a stack of old books. Hyper-real photograph. Close-up, shallow depth of field, rule of thirds. Warm golden hour side-lighting, long shadows. Muted earth tones with a pop of teal. In a dimly lit library with leather-bound volumes. No text, no watermarks.
The agent should auto-enhance user prompts by filling in missing segments. If the user says "make me a picture of a cat", the agent adds style, composition, lighting, and color based on context. No API call needed for prompt enhancement -- the agent does it.
See references/sample-prompts.md for curated examples by category.
xAI's image models (built on the Aurora architecture, evolved from FLUX.1) have a strong default toward cinematic drama: high saturation, volumetric lighting, atmospheric depth, and glossy surfaces. This is by design -- xAI markets the models as "especially strong at cinematic instructions." But for editorial, travel, product, or documentary photography, this default produces oversaturated, ornamental results that read as obviously AI-generated.
The xAI API provides no style parameter. All aesthetic steering must happen through the prompt itself (and --prefix). Negative prompts do not work. You cannot say "no oversaturation" -- you must describe what you do want.
xAI responds better to natural-language scene descriptions than to comma-separated keyword lists. Structure prompts with these five components:
[SCENE] What is happening. Write it like a short film direction.
[STYLE] Visual aesthetic anchor. Be specific: "editorial travel magazine" not "beautiful".
[MOOD] Emotional direction: "understated", "contemplative", "clean", "warm".
[LIGHTING] Use precise references: "3 PM October sunlight", "overcast diffused", "window light from camera left".
[CAMERA] Camera body + lens implies color science, grain, and DOF without listing each:
"shot on Fujifilm X-T4, 35mm f/1.4" or "Hasselblad medium format, natural film grain".
Camera references are the single most powerful shorthand. "Shot on Fujifilm XT4" bundles film-like color science, natural grain, and warm tones into three words. Other effective references:
Use --prefix to apply consistent aesthetic direction across multiple images. These are tested prefixes that counteract xAI's dramatic defaults:
Editorial travel photography:
--prefix "Clean editorial travel photography. Shot on Fujifilm X-T4, 35mm lens. Natural lighting, muted warm tones. Documentary feel, not dramatic. Soft grain, gentle vignette."
Product photography:
--prefix "Clean commercial product photography. Soft diffused studio lighting. Neutral white background. Sharp focus, natural colors. No dramatic shadows or atmospheric effects."
Real estate and interiors:
--prefix "Architectural photography. Shot on Canon EOS R5 with tilt-shift lens. Even natural lighting, true-to-life colors. Clean, unprocessed look. No HDR, no dramatic contrast."
Editorial food photography:
--prefix "Editorial food photography. Shot on Hasselblad, 80mm lens. Warm directional light from 10 o'clock. Shallow depth of field. Natural colors, no oversaturation."
Modern Industrial Luxury (The 'NJOY' Aesthetic):
--prefix "Ultra-sharp modern commercial product photography isolated on a pure, deep black void. High-contrast, stark, sculptural studio lighting. Deep rich true blacks, bright pristine highlights. Sleek, premium, sterile yet luxurious aesthetic. Shot on Canon EOS R5. No film grain. No vintage retro grading."
Note on Modern Industrial Luxury: This aesthetic works by treating mundane, utilitarian objects (like a plastic ethernet connector, a rubber cable, or a basic bracket) with the exact same reverence and dramatic lighting used for high-end consumer electronics or luxury watches. It is about the treatment (pure black background, stark sculptural lighting, ultra-sharp focus), not just the material.
Portraits and headshots:
--prefix "Natural portrait photography. Shot on Canon 85mm f/1.2. Window light, soft and diffused. True skin tones, no airbrushing. Documentary, not glamour."
Landscape and nature:
--prefix "Shot on Kodak Portra 400 film. Muted earth tones, soft grain. Natural, understated. No HDR, no oversaturation, no volumetric god rays."
Grok struggles more with text rendering and typography layout than OpenAI. If you simply ask for "A poster that says X", Grok will often garble the text or lose the structural layout. To get high-quality text out of Grok:
--resolution 2k (or the grok-imagine-image-quality model). The higher resolution is strictly necessary for the text rendering engine to resolve smaller letters clearly.The exact text reads: 'HELLO WORLD' rather than It should say hello world.At the top, bold condensed sans-serif text reads "ARIZONA". Below the illustration, smaller italic serif text reads "Land of the Sun".Aurora (the engine behind grok-imagine-image) is an autoregressive mixture-of-experts network, not a diffusion model. It generates images patch by patch the way a language model generates tokens. This has direct prompting implications:
For text-heavy graphics (posters, lineups, menus, cards), Grok works best with:
One killer visual + crisp text + loaded atmosphere.
Do NOT try to illustrate every element. The festival lineup poster that worked had ONE dominant visual (neon hot dog sun over lavender fields) and let the TEXT carry the lineup. The five genre re-skins that looked generic each tried to cram 9 competing vignettes into the frame.
Recipe:
Aurora responds to named references because of its broad training data. Use these as shorthand:
When tasked with generating a highly specific image (like a poster with text, or a complex visual concept), do not expect to get it right on the first try. You must learn how to prompt by iterating.
At $0.02-0.07/image on Grok, you should iterate freely. The standard agent workflow for complex image requests is:
describe command, or by asking the user for qualitative feedback like "what went wrong with V1?").Grok's image generation API has minimal visual content filtering. Empirical results from systematic testing:
Renders without refusal:
Artistic framing bypasses all filters: "Renaissance oil painting", "marble sculpture", "charcoal life drawing", "silhouette art", and "boudoir photography" all render content that would be refused with direct phrasing.
Known hard limits (tested April 2026):
edit command (which takes a real photo as input) returns 400 when asked to render the person nude. This prevents deepfake-style abuse. The generate command renders nudity from scratch without issue.Comparison with OpenAI: OpenAI refuses silhouettes of sexual positions, nudity in any framing, and most boudoir content. OpenAI renders clothed suggestive poses, romantic couples in bed (clothed), and pin-up art with modest clothing. The gap is significant.
Grok renders any text on posters and graphics without filtering. Tested terms that rendered cleanly:
Text garble patterns:
Some visual formats produce near-perfect text rendering because the format itself implies structured, readable text:
| Format | Text Accuracy | Why It Works | |--------|--------------|-------------| | Diner receipt | 10/10 | Dot-matrix monospace on thermal paper -- text IS the format | | Boarding pass | 10/10 | Structured fields with clear hierarchy | | Fortune cookie | 10/10 | Short text on paper strip -- minimal, focused | | Movie marquee | 10/10 | Letter tiles on lit sign -- each character is a physical object | | Neon sign | 9/10 | Neon tubes naturally form letters -- structural | | Love note | 9/10 | Cursive handwriting -- longer text but high fidelity | | Fight card | 9/10 | Bold typography hierarchy -- headliner + undercard | | Tasting menu | 8/10 | Serif font on paper -- occasional garbles on long words | | Festival lineup | 8/10 | Works best with short act names in ALL CAPS |
The principle: formats where text is a physical object in the scene (receipt paper, letter tiles, neon tubes, fortune strips) render more accurately than formats where text is overlaid on imagery.
Choose xAI or OpenAI based on what you're making:
| Content Type | Recommended Provider | Why |
|---|---|---|
| Travel editorial | OpenAI | Better prompt adherence for restrained, non-cinematic styles |
| Product photography | OpenAI | More neutral defaults, transparent background support |
| Portraits (likeness preservation) | xAI Pro | Stronger photorealism for human faces |
| Concept art / fantasy | xAI | The dramatic default is an asset here |
| Social media graphics | xAI | Fast, cheap ($0.02-0.07), more permissive content policy |
| Brand-consistent campaigns | OpenAI | Better color consistency across generations |
| Illustrations from photos | Either | xAI with strong --prefix steering, or OpenAI for more control |
| Text in images | OpenAI | Better text rendering accuracy |
| Batch generation (10+ images) | xAI | 10x cheaper at scale ($0.70 vs $1.33-$2.00 for 10 images) |
| Transparent backgrounds | OpenAI | xAI does not support transparent backgrounds |
When cost is the primary concern and the aesthetic can be steered with --prefix, use xAI. When precise style adherence matters more than cost, use OpenAI.
The --resolution flag controls output pixel dimensions on xAI:
# Standard resolution (default, ~1024px)
omni-image generate "your prompt" -o result.png
# High resolution (2K, sharper details, same price)
omni-image generate "your prompt" --resolution 2k -o result.png
The grok-imagine-image-quality model benefits most from --resolution 2k -- it produces noticeably sharper details and better text rendering at 2K. The standard model's improvement is more modest. There is no price difference between 1K and 2K on xAI.
When the user provides a photo of themselves or another person and wants to generate new images that preserve their identity, use the Identity Preservation Framework. This is a prompt structure, not an API feature -- the model's ability to retain facial likeness depends entirely on how the prompt is constructed.
Every prompt that references a person's photo must include these three layers before any creative direction:
1. SUBJECT REFERENCE → "Use the uploaded image of me as the subject reference."
2. IDENTITY LOCK → "Preserve my facial features, proportions, age, skin texture, hairstyle, and expression exactly."
3. STYLE EXCLUSION → "Do not stylise the face. Do not cartoonise. Do not anime."
Why all three? The subject reference tells the model which pixels matter. The identity lock closes the biggest failure mode (the model "improving" or smoothing faces). The style exclusion bans the shortcuts the model defaults to when given creative freedom.
omni-image edit \
"Use the uploaded image of me as the subject reference. \
Preserve my facial features, proportions, age, skin texture, hairstyle, and expression exactly. \
Do not stylise the face. Do not cartoonise. Do not anime. \
Style: Photorealistic, cinematic photography. Real textures. Natural skin. No illustration, no CGI look. \
[YOUR SCENE/ACTION/ENVIRONMENT DESCRIPTION HERE]. \
Lighting: [LIGHTING DESCRIPTION]. \
Composition: [FRAMING DESCRIPTION]." \
-i person_photo.jpg --input-fidelity high --quality high -o result.png
--input-fidelity highFor any edit involving a person's face, always use --input-fidelity high. This preserves the spatial layout of the source image, which includes facial geometry.
high -- Face structure, proportions, and pose are preserved. Use for portraits, headshots, and any edit where the person must remain recognizable.low (default) -- Face is treated as a loose reference. The model may alter proportions, smooth features, or "improve" the face. Only use low when you want the person's photo as rough inspiration, not identity preservation.The model takes creative shortcuts when given room. Prevent specific failure modes:
| Failure Mode | Defensive Prompt Line | |---|---| | Plastic/smoothed skin | "Natural skin texture. No smoothing. No airbrushing." | | Age modification | "Preserve exact age appearance. No de-aging." | | Face flattening under contrast | "Natural shadows on the face. Preserve facial depth." | | Stylization creep | "Do not stylise the face. Do not cartoonise. Do not anime." | | Unwanted "glow-up" | "No beautification. No enhancement. Exact likeness only." | | Hair changes | "Preserve exact hairstyle, color, and texture." |
Known limitation: Identity preservation degrades significantly with multiple people.
Workaround for groups: Generate each person separately against the same background, then composite. Or edit one person at a time using masks.
omni-image edit \
"Use the uploaded image of me as the subject reference. \
Preserve my facial features, proportions, age, skin texture, hairstyle, and expression exactly. \
Do not stylise the face. Do not cartoonise. Do not anime. \
Create a photorealistic action figure of me in a clear plastic blister pack. \
The figure should be in a heroic pose wearing tactical gear. \
The packaging should read 'LIMITED EDITION'. \
The face on the figure must match my exact likeness. \
Studio lighting, product photography, white background." \
-i selfie.jpg --input-fidelity high --quality high -o action_figure.png
omni-image edit \
"Use the uploaded image of me as the subject reference. \
Preserve my facial features, proportions, age, skin texture, hairstyle, and expression exactly. \
Do not stylise the face. No smoothing. No beautification. \
Professional corporate headshot. Shoulders up. \
Dark charcoal suit jacket, white dress shirt, no tie. \
Neutral grey gradient background. \
Soft, even studio lighting. Slight catchlight in eyes. \
Composition: Centered, slight head tilt, natural relaxed expression." \
-i casual_photo.jpg --input-fidelity high --quality high -o headshot.png
PRIORITY: Use your NATIVE VISION to understand the subject first.
Only run the describe command if you need precise GPT-4o details for an identity-lock prompt:
When crafting an edit prompt for a person's photo, first analyze the image to understand what you're working with:
# Analyze the source photo before editing
omni-image describe person_photo.jpg --mode detailed
Use the description to write a more precise identity lock. Instead of generic "preserve my facial features," you can reference specifics from the analysis: "Preserve the subject's angular jaw, close-cropped dark hair, light stubble, and deep-set brown eyes exactly."
This also helps identify whether the photo has multiple people, poor lighting, or other factors that affect likeness preservation.
When the user provides a photo and asks you to generate images using their likeness:
describe --mode detailed on the photo first to understand the subject.edit (not generate) with their photo as input.--input-fidelity high.--quality high for final outputs involving faces (low quality degrades likeness).style-transfer on photos of people when likeness matters. Use edit with a style directive instead.When generating a cohesive set of images (product shots, menu items, page backgrounds), use these techniques to keep them visually unified:
1. Use --prefix for a shared style preamble. Every prompt gets the same visual DNA:
PREFIX="Vivid, hyper-real 1920s cinematic movie still. Rich jewel tones, warm golden lighting, film grain."
omni-image --prefix "$PREFIX" generate "blue cocktail in a coupe glass" --quality high -o drink1.png
omni-image --prefix "$PREFIX" generate "amber old fashioned with smoke" --quality high -o drink2.png
omni-image --prefix "$PREFIX" generate "emerald absinthe drip" --quality high -o drink3.png
2. Use batch for manifests. Define the prefix once, list all jobs:
omni-image --retries 3 batch drinks.json --output-dir ./public/images/
3. Keep quality and size consistent. Mixing --quality medium and --quality high across a series produces visible inconsistency. Pick one and stick with it.
4. Use reference photos as structural anchors. Feed the same glass, product, or venue photo into multiple edit calls with different prompts. The shared geometry keeps the series grounded. See "Reference-Based Generation" above.
5. The "Mise en place" method. When generating images that involve multiple steps, variations, or use the same recurring elements (like ingredients for a recipe, tools for a craft, or parts of a product), first generate a single "mise en place" style image that contains all the individual elements laid out clearly on a flat, neutral surface. You can then use this initial "ingredients" image as a structural anchor (using edit with --input-fidelity) for subsequent generations, ensuring visual consistency of the core components across the entire series.
6. Mixing generated and style-transferred images. When a page combines AI-generated illustrations with style-transferred photos, use --prefix with explicit color direction on the style-transfers to match the generated palette. Without this, style-transfer presets (especially watercolor) produce cooler, more washed-out tones than generated images, creating a visible mismatch. See Color Palette Control above.
Masks are PNG files with an alpha channel. Fully transparent pixels mark the area to edit; opaque pixels protect the original. To create a mask:
The mask must match the source image dimensions.
All commands return structured JSON:
{
"status": "success",
"message": "Generated 1 image(s)",
"model": "gpt-image-2",
"images": [
{"index": 0, "path": "/absolute/path/to/gen_20260304_143022.png"}
]
}
The describe command returns text instead of images:
PNG output files include embedded metadata (tEXt chunks) with the prompt, model, quality, and size used to generate them. View with identify -verbose file.png (ImageMagick) or any PNG metadata viewer.
{
"status": "success",
"message": "Described 1 image",
"model": "gpt-4o-mini",
"mode": "alt-text",
"result": {
"file": "/path/to/photo.jpg",
"description": "Pixel art spaceship with blue cockpit and orange thrusters on white background"
}
}
source ~/.secrets or export XAI_API_KEY='xai-...'. This is the default provider.describe command and --provider openai. Run source ~/.secrets or export OPENAI_API_KEY='sk-...'.pip install openai. Both providers use this SDK.--retries 3 to automatically retry with exponential backoff (1s, 2s, 4s delays). Retry status is logged to stderr so you can monitor progress. For batch jobs, always use --retries 3.WriteError, the API returned empty data. Retry the command or check your API quota.OpenAI also offers a Responses API that wraps image generation as a tool inside a conversational model call. This skill's script uses the Image API (direct generation/edit), which is simpler and more predictable for agent workflows.
The Responses API adds:
action parameter: auto (let model decide), generate (force new image), edit (force editing an image in context)chatgpt-image-latest model alias: Tracks the latest image generation model for conversational useWhen to use which:
The Responses API requires using the OpenAI SDK directly (not this script). Example:
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
model="gpt-5",
input="Generate an image of a modern dashboard with dark theme",
tools=[{"type": "image_generation", "action": "generate"}],
)
This skill does not wrap the Responses API. Use it directly when multi-turn editing is needed.
Documented from a 214-image session testing formats, styles, content limits, and couple photo edits across Grok and OpenAI. These are empirical findings, not theoretical.
Formats where text is a physical object in the scene render more accurately than overlaid typography:
| Tier | Formats | Why | |------|---------|-----| | S tier | Diner receipt, boarding pass, fortune cookie, movie marquee letter tiles | Text IS the object. Monospace/tile formats are structurally constrained. | | A tier | Neon sign, love note (handwritten), rolling paper note, door hanger | Text rendered as tubes, ink, or printed label. Physical anchor. | | B tier | Boxing fight card, tasting menu, concert poster | Bold hierarchy works but long words garble in smaller tiers. | | C tier | Festival lineup, app mockups (Spotify/Amazon/Yelp) | Multiple text blocks competing. Occasional substitution errors. |
When editing a real photo of a couple into a new scene:
Body description is critical. AI normalizes all bodies to "average fit." If someone is a bodybuilder, you MUST specify "extremely muscular bodybuilder physique, massive defined pecs, huge arms with visible veins, thick shoulders and traps" or the output will flatten them. Similarly, describe her as "beautiful, gorgeous curves, radiant, confident, glowing skin" -- the AI will make her look great but needs the direction.
Composition direction matters. Specify who is the focus:
Two modes for face handling in edits:
Use photorealistic preservation for:
Use stylized likeness for:
Styles with face limitations:
Edit endpoint content boundaries (Grok):
Grok renders Spanish text as reliably as English. Tested extensively:
| Word | Garbles To | Fix | |------|-----------|-----| | "Blowjob" | "Blaybob", "Blowbob", "Blayjob" | Use "HEAD", "BJ", or ALL CAPS in large fonts | | "Blowjob" in ALL CAPS large font | Renders clean | Preferred approach | | All other explicit English terms | Render clean | No workaround needed | | All Spanish explicit terms | Render clean | No issues found |
Don't repeat the same reference every time. Tested alternatives that all render cleanly:
| Reference | Spanish | Tone | |-----------|---------|------| | "Eat you out" | "Te como toda" | Direct | | "Fuck you until you can't walk" | "Te cojo hasta que no puedas caminar" | Dominant | | "Make you mine" | "Te hago mia" | Romantic-dominant | | "You won't be sleeping" | "No vas a dormir" | Implied | | "You already know" | "Lo que tu ya sabes" / "Ella ya sabe" | Winking | | "Don't make plans tomorrow" | "No hagas planes manana" | Implied consequence | | "I'm your dessert" | "Soy tu postre" | Playful | | "Don't wear underwear" | "No te pongas pantaletas" | Command | | "She always wins" | "Ella siempre gana" | Her-focused | | "..." or just the vibe | No text needed | Let the image speak | | "Walls will shake" | "Tiemblen las paredes" | Hyperbolic |
For sessions generating 50+ images:
--dry-run before large batches--retries 2 on batches -- xAI has ~10% transient failure ratewinner_v2_noir.png not gen_20260409_123456.pngsubprocess.run(["omni-image", "generate", p["prompt"], ...]). See examples/python_wrapper_example.py for a real-world script using the 5-part SCENE / STYLE / MOOD / LIGHTING / CAMERA formula.At $0.07/image (Grok Pro), large sessions are cheap:
A single stock photo license costs more than a 50-image generation session. Iterate freely.
development
Sets up, configures, and optimizes Google Analytics 4 (GA4) properties. Evaluates websites for proper GA4 implementation, tracking codes, and configuration improvements. Uses the Google Analytics Admin API for programmatic setup or provides manual integration paths via gtag.js or Next.js Third Parties.
development
Open-source intelligence on people, companies, domains, and B2B accounts. Use when the user wants to investigate, vet, research, or build a dossier on a target — phrases like "OSINT", "due diligence", "background check", "research this person", "look into [company/domain]", "vet this prospect/vendor", "what does X do", "is this account worth pursuing", "find me a contact at", "who's the buyer for", or any open-source investigation task. Disambiguates identities before reporting and grades every claim by independent source count.
testing
Agentic OS Orchestrator. Process and execute tasks from the shared .agent/state/tasks.json queue. Use when the user asks to 'check the queue', 'process tasks', or run the heartbeat.
tools
Delegate a sub-task to Claude Code via the Agent Client Protocol (ACP). Use this skill whenever you want to hand off work to Claude — complex agentic coding with MCP tool access, detailed multi-file refactors, tasks requiring Claude's reasoning style, or anything where Claude's strengths give an advantage. Also invoke when the user asks you to "ask Claude", "use Claude for this", or "run this through Claude". The script handles subprocess lifecycle and ACP session setup; you just provide the prompt and read stdout.