detecting-tips-zones/SKILL.md
Text-prompted image zone detection using TIPSv2 B/14 on CPU. Produces `focus_targets` / `focus_edges` bbox lists from natural-language labels, ready to feed into `svg-portrait-mode`. Use when you want automatic foreground/background separation from prompts like "dog face" + "wooden floor" instead of hand-annotating bboxes.
npx skillsauth add oaustegard/claude-skills detecting-tips-zonesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Zero-shot zone detection: text prompts → patch-grid cosine heatmaps → bboxes.
Companion to svg-portrait-mode — replaces manual focus_targets / focus_edges
annotation with a TIPSv2 B/14 forward pass.
from tips_zones import detect_zones
from portrait_mode import portrait_mode
focus_targets, focus_edges = detect_zones(
"photo.jpg",
targets=["dog face"],
edges=["dog paws", "dog ears", "dog body"],
distractors=["wooden floor", "carpet rug", "shoes", "wall"],
ckpt_dir="/path/to/tips/checkpoints",
tips_root="/path/to/tips",
)
svg, stats = portrait_mode(
"photo.jpg",
focus_targets=focus_targets,
focus_edges=focus_edges,
style_transforms={"background": "desaturate:0.7"},
)
Amortise model load across multiple images:
from tips_zones import load_models, detect_zones
models = load_models(ckpt_dir, tips_root, device="cpu")
for img in images:
ft, fe = detect_zones(img, targets=[...], edges=[...], distractors=[...],
ckpt_dir=ckpt_dir, tips_root=tips_root, models=models)
...
image → B/14 vision encoder (MaskCLIP values trick on last block)
→ (32×32 patch grid at 448, or 64×64 at 896) × 768-d patch features
text labels → prompt ensemble (9 TCL templates) → B/14 text encoder
→ per-label mean feature → L2-normalise
per-label heatmap = cos(patch feature, label feature) # raw, no softmax
bbox = top-k% patches → largest connected component → scaled + padded to image coords
Naïve softmax assumes labels are mutually exclusive. dog face, dog ears,
and dog body are all true of the same pixels, so softmax collapses to
near-uniform and every heatmap covers the whole subject. Raw cosines +
per-label top-k threshold works much better — at the cost of requiring
distractor labels to anchor the relative scale. Always pass some
distractors (floor, wall, props — whatever is in the scene but not the
subject).
detect_zones(
image, # path | PIL Image
targets, # ["main subject label", ...]
edges=(), # ["sub-region label", ...]
distractors=(), # scene elements to anchor against — pass these!
*,
ckpt_dir, # has tips_v2_oss_b14_{vision,text}.npz + tokenizer.model
tips_root, # local clone of google-deepmind/tips
input_size=448, # 448 → 32×32 grid, 896 → 64×64 (~12× slower on CPU)
target_top_frac=0.04, # fraction of patches kept per target label
edge_top_frac=0.06, # fraction of patches kept per edge label
pad_frac=0.02, # bbox padding as fraction of image dim
device="cpu",
models=None, # optional pre-loaded (img_model, text_model, tokenizer)
)
Returns (focus_targets, focus_edges) — both lists of {'bbox': (x1,y1,x2,y2), 'label': str}.
| Step | Time |
|------|------|
| load_models (warm) | ~3.5s |
| load_models (cold, over 9p) | ~50s |
| Text encoding (9 templates × N labels) | ~0.1s |
| Vision forward @ 448 | 0.3–0.6s |
| Vision forward @ 896 | ~6–7s |
Inference is negligible next to portrait_mode() on large images.
Subject / background split: strong. B/14 separates subject from scene reliably — typical split ~30/70 subject:background on single-subject photos.
Sub-part discrimination: weak at B/14 + 448. "dog face" vs "dog paws" vs "dog ears" tend to fire on the same region. The 32×32 patch grid is not the bottleneck (64×64 at 896 barely helps); B/14's patch features just don't encode fine sub-part semantics strongly. If you need per-part zones:
For coarse target/edge zoning (the portrait_mode use case), B/14 at 448 is
enough.
Python deps:
pip install torch torchvision tensorflow tensorflow-text scipy pillow numpy --break-system-packages -q
Upstream TIPS repo (for the tips.pytorch image/text encoder modules):
git clone https://github.com/google-deepmind/tips /path/to/tips
B/14 checkpoints (~500MB total) go in a directory passed as ckpt_dir:
tips_v2_oss_b14_vision.npztips_v2_oss_b14_text.npztokenizer.modelDownload links are in the TIPS repo README.
target_top_frac
(0.04 → 0.08). Too big / bleeds into scene: lower it.pad_frac=0.02 works for most photos; raise to 0.05 for
subjects near frame edges.portrait_mode (via OpenCV) honours EXIF rotation. PIL (this skill's
preprocessing) does not. For correctly-oriented source images they agree; for
EXIF-rotated phone photos the detected bboxes will be in the raw pixel
orientation. Either:
Image.open(p).rotate(0, expand=True).save(p)ImageOps.exif_transpose(pil) before passing to detect_zones.testing
Disciplined, validation-gated revision of an EXISTING skill so each edit is a measured improvement rather than a guess. Use when editing, revising, or tuning a skill that already exists and there is evidence it underperforms (observed failures, drift, complaints) — invoke by name, or have versioning-skills / creating-skill defer to it before applying edits. Not for authoring a brand-new skill from scratch (use creating-skill) or one-off prose.
development
Skill-aware orchestration with context routing. Decomposes complex tasks into skill-typed subtasks, extracts targeted context subsets, executes subagents in parallel, and synthesizes results. Self-answers trivial lookups inline. No SDK dependency — uses raw HTTP via httpx. Use when tasks require multiple analytical perspectives, when context is large and subtasks only need portions, or when orchestrating-agents spawns too many redundant subagents.
tools
Orchestrates parallel API instances, delegated sub-tasks, and multi-agent workflows with streaming and tool-enabled delegation patterns. Use for parallel analysis, multi-perspective reviews, or complex task decomposition.
development
Invokes Google Gemini models for structured outputs, image generation, multi-modal tasks, and Google-specific features. Use when users request Gemini, image generation, structured JSON output, Google API integration, or cost-effective parallel processing.