Computer Vision Helper

The Computer Vision Helper skill guides you through implementing image analysis and visual AI tasks. From basic image classification to complex object detection and segmentation, this skill helps you leverage modern computer vision techniques effectively.

Computer vision has been transformed by deep learning and now by vision-language models. This skill covers both traditional approaches (CNNs, pre-trained models) and cutting-edge techniques (CLIP, GPT-4V, Segment Anything). It helps you choose the right approach based on your accuracy requirements, available data, and deployment constraints.

Whether you are building product recognition, document analysis, medical imaging, or any visual AI application, this skill ensures you understand the landscape and implement solutions that work.

Core Workflows

Workflow 1: Select Computer Vision Approach

Define the task:
- Classification: What category is this image?
- Detection: Where are objects in this image?
- Segmentation: Pixel-level object boundaries
- OCR: Extract text from images
- Similarity: Find similar images
- Generation: Create or modify images
Assess available resources:
- Training data quantity and quality
- Compute budget (training and inference)
- Latency requirements
- Accuracy needs
Choose approach: | Task | No Training Data | Small Dataset | Large Dataset | |------|-----------------|---------------|---------------| | Classification | CLIP, GPT-4V | Transfer learning | Fine-tune/train | | Detection | GPT-4V, Grounding DINO | Fine-tune YOLO | Train custom | | Segmentation | SAM | Fine-tune SAM | Train custom | | OCR | Cloud APIs, Tesseract | Fine-tune | Train custom |
Plan implementation
Document approach rationale

Workflow 2: Implement Image Classification

Prepare data:

# Data loading with augmentation
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                       std=[0.229, 0.224, 0.225])
])

dataset = ImageFolder(root='data/', transform=transform)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

Set up model:

# Transfer learning from pretrained model
model = models.resnet50(pretrained=True)

# Freeze early layers
for param in model.parameters():
    param.requires_grad = False

# Replace classifier head
model.fc = nn.Linear(model.fc.in_features, num_classes)

Train with validation
Evaluate on test set
Optimize for deployment

Workflow 3: Deploy Vision Model

Optimize model:
- Quantization (INT8)
- Pruning
- ONNX export
- TensorRT optimization

Set up inference pipeline:

class VisionPipeline:
    def __init__(self, model_path):
        self.model = load_optimized_model(model_path)
        self.preprocessor = ImagePreprocessor()

    def predict(self, image):
        # Preprocess
        tensor = self.preprocessor.process(image)

        # Inference
        with torch.no_grad():
            output = self.model(tensor)

        # Postprocess
        return self.postprocess(output)

    def predict_batch(self, images):
        tensors = [self.preprocessor.process(img) for img in images]
        batch = torch.stack(tensors)

        with torch.no_grad():
            outputs = self.model(batch)

        return [self.postprocess(out) for out in outputs]

Deploy to target environment
Monitor performance

Quick Reference

| Action | Command/Trigger | |--------|-----------------| | Choose approach | "What CV approach for [task]" | | Classify images | "Build image classifier" | | Detect objects | "Object detection for [use case]" | | Extract text | "OCR from images" | | Zero-shot vision | "Classify images without training data" | | Optimize model | "Speed up vision model" |

Best Practices

Start with Pre-trained: Don't train from scratch unless necessary
- ImageNet pre-trained models for general vision
- Domain-specific models when available
- CLIP/GPT-4V for zero-shot capabilities
Data Quality Over Quantity: Clean, balanced data matters
- Remove mislabeled and duplicate images
- Balance classes or use weighted training
- Include edge cases in test set
Augment Thoughtfully: Augmentation should reflect real variation
- Use augmentations that mirror production conditions
- Don't augment in ways that destroy task-relevant features
- Test that augmentation helps, don't assume
Validate Correctly: Image data leaks easily
- Split by unique images, not by augmented versions
- Consider subject-level splits (same person in different photos)
- Test on truly held-out data
Optimize for Target Hardware: Inference matters
- Know your deployment constraints (edge vs cloud)
- Profile and optimize bottlenecks
- Consider batch size for throughput
Handle Edge Cases: Real images are messy
- Different lighting conditions
- Rotation, blur, occlusion
- Unusual aspect ratios
- Out-of-distribution inputs

Advanced Techniques

Vision-Language Models for Zero-Shot

Use CLIP for classification without training:

import clip

model, preprocess = clip.load("ViT-B/32")

def zero_shot_classify(image, labels):
    # Prepare image
    image_tensor = preprocess(image).unsqueeze(0)

    # Prepare text prompts
    text_prompts = [f"a photo of a {label}" for label in labels]
    text_tokens = clip.tokenize(text_prompts)

    # Get embeddings
    with torch.no_grad():
        image_features = model.encode_image(image_tensor)
        text_features = model.encode_text(text_tokens)

    # Compute similarities
    similarities = (image_features @ text_features.T).softmax(dim=-1)

    return {label: sim.item() for label, sim in zip(labels, similarities[0])}

GPT-4V for Visual Analysis

Use multimodal LLMs for complex vision tasks:

def analyze_image(image_path, question):
    import base64
    from openai import OpenAI

    # Encode image
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode()

    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": question},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/jpeg;base64,{image_data}"
                }}
            ]
        }],
        max_tokens=500
    )

    return response.choices[0].message.content

Object Detection with YOLO

Fast, accurate object detection:

from ultralytics import YOLO

# Load pretrained model
model = YOLO("yolov8n.pt")

# Fine-tune on custom dataset
model.train(
    data="custom_dataset.yaml",
    epochs=100,
    imgsz=640,
    batch=16
)

# Inference
results = model.predict(source="image.jpg", conf=0.5)

for result in results:
    boxes = result.boxes
    for box in boxes:
        xyxy = box.xyxy[0].tolist()  # Bounding box
        conf = box.conf[0].item()     # Confidence
        cls = box.cls[0].item()       # Class ID
        print(f"Detected {cls} at {xyxy} with confidence {conf}")

Segment Anything (SAM)

Universal segmentation:

from segment_anything import sam_model_registry, SamPredictor

# Load SAM
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h.pth")
predictor = SamPredictor(sam)

# Set image
predictor.set_image(image)

# Segment with point prompt
masks, scores, logits = predictor.predict(
    point_coords=np.array([[500, 375]]),  # Click point
    point_labels=np.array([1]),            # 1 = foreground
    multimask_output=True
)

# Segment with box prompt
masks, scores, logits = predictor.predict(
    box=np.array([x1, y1, x2, y2])
)

Model Optimization Pipeline

Prepare models for production:

def optimize_for_deployment(model, sample_input):
    # Step 1: Export to ONNX
    torch.onnx.export(
        model,
        sample_input,
        "model.onnx",
        opset_version=13,
        dynamic_axes={"input": {0: "batch"}}
    )

    # Step 2: Quantize (INT8)
    from onnxruntime.quantization import quantize_dynamic
    quantize_dynamic(
        "model.onnx",
        "model_quantized.onnx",
        weight_type=QuantType.QInt8
    )

    # Step 3: Benchmark
    import onnxruntime as ort
    session = ort.InferenceSession("model_quantized.onnx")
    benchmark_inference(session, sample_input)

    return "model_quantized.onnx"

Common Pitfalls to Avoid

Training from scratch when transfer learning would work
Not augmenting data appropriately for the task
Data leakage through improper train/test splits
Ignoring class imbalance in training data
Overfitting to training data without regularization
Not testing on diverse, real-world images
Deploying without latency and throughput testing
Assuming models work on all image types without testing

Computer Vision Helper

Whether you are building product recognition, document analysis, medical imaging, or any visual AI application, this skill ensures you understand the landscape and implement solutions that work.

Core Workflows

Workflow 1: Select Computer Vision Approach

Define the task:
- Classification: What category is this image?
- Detection: Where are objects in this image?
- Segmentation: Pixel-level object boundaries
- OCR: Extract text from images
- Similarity: Find similar images
- Generation: Create or modify images
Assess available resources:
- Training data quantity and quality
- Compute budget (training and inference)
- Latency requirements
- Accuracy needs
Choose approach: | Task | No Training Data | Small Dataset | Large Dataset | |------|-----------------|---------------|---------------| | Classification | CLIP, GPT-4V | Transfer learning | Fine-tune/train | | Detection | GPT-4V, Grounding DINO | Fine-tune YOLO | Train custom | | Segmentation | SAM | Fine-tune SAM | Train custom | | OCR | Cloud APIs, Tesseract | Fine-tune | Train custom |
Plan implementation
Document approach rationale

Workflow 2: Implement Image Classification

Prepare data:

# Data loading with augmentation
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                       std=[0.229, 0.224, 0.225])
])

dataset = ImageFolder(root='data/', transform=transform)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

Set up model:

# Transfer learning from pretrained model
model = models.resnet50(pretrained=True)

# Freeze early layers
for param in model.parameters():
    param.requires_grad = False

# Replace classifier head
model.fc = nn.Linear(model.fc.in_features, num_classes)

Train with validation
Evaluate on test set
Optimize for deployment

Workflow 3: Deploy Vision Model

Optimize model:
- Quantization (INT8)
- Pruning
- ONNX export
- TensorRT optimization

Set up inference pipeline:

class VisionPipeline:
    def __init__(self, model_path):
        self.model = load_optimized_model(model_path)
        self.preprocessor = ImagePreprocessor()

    def predict(self, image):
        # Preprocess
        tensor = self.preprocessor.process(image)

        # Inference
        with torch.no_grad():
            output = self.model(tensor)

        # Postprocess
        return self.postprocess(output)

    def predict_batch(self, images):
        tensors = [self.preprocessor.process(img) for img in images]
        batch = torch.stack(tensors)

        with torch.no_grad():
            outputs = self.model(batch)

        return [self.postprocess(out) for out in outputs]

Deploy to target environment
Monitor performance

Quick Reference

Best Practices

Start with Pre-trained: Don't train from scratch unless necessary
- ImageNet pre-trained models for general vision
- Domain-specific models when available
- CLIP/GPT-4V for zero-shot capabilities
Data Quality Over Quantity: Clean, balanced data matters
- Remove mislabeled and duplicate images
- Balance classes or use weighted training
- Include edge cases in test set
Augment Thoughtfully: Augmentation should reflect real variation
- Use augmentations that mirror production conditions
- Don't augment in ways that destroy task-relevant features
- Test that augmentation helps, don't assume
Validate Correctly: Image data leaks easily
- Split by unique images, not by augmented versions
- Consider subject-level splits (same person in different photos)
- Test on truly held-out data
Optimize for Target Hardware: Inference matters
- Know your deployment constraints (edge vs cloud)
- Profile and optimize bottlenecks
- Consider batch size for throughput
Handle Edge Cases: Real images are messy
- Different lighting conditions
- Rotation, blur, occlusion
- Unusual aspect ratios
- Out-of-distribution inputs

Advanced Techniques

Vision-Language Models for Zero-Shot

Use CLIP for classification without training:

import clip

model, preprocess = clip.load("ViT-B/32")

def zero_shot_classify(image, labels):
    # Prepare image
    image_tensor = preprocess(image).unsqueeze(0)

    # Prepare text prompts
    text_prompts = [f"a photo of a {label}" for label in labels]
    text_tokens = clip.tokenize(text_prompts)

    # Get embeddings
    with torch.no_grad():
        image_features = model.encode_image(image_tensor)
        text_features = model.encode_text(text_tokens)

    # Compute similarities
    similarities = (image_features @ text_features.T).softmax(dim=-1)

    return {label: sim.item() for label, sim in zip(labels, similarities[0])}

GPT-4V for Visual Analysis

Use multimodal LLMs for complex vision tasks:

def analyze_image(image_path, question):
    import base64
    from openai import OpenAI

    # Encode image
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode()

    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": question},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/jpeg;base64,{image_data}"
                }}
            ]
        }],
        max_tokens=500
    )

    return response.choices[0].message.content

Object Detection with YOLO

Fast, accurate object detection:

from ultralytics import YOLO

# Load pretrained model
model = YOLO("yolov8n.pt")

# Fine-tune on custom dataset
model.train(
    data="custom_dataset.yaml",
    epochs=100,
    imgsz=640,
    batch=16
)

# Inference
results = model.predict(source="image.jpg", conf=0.5)

for result in results:
    boxes = result.boxes
    for box in boxes:
        xyxy = box.xyxy[0].tolist()  # Bounding box
        conf = box.conf[0].item()     # Confidence
        cls = box.cls[0].item()       # Class ID
        print(f"Detected {cls} at {xyxy} with confidence {conf}")

Segment Anything (SAM)

Universal segmentation:

from segment_anything import sam_model_registry, SamPredictor

# Load SAM
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h.pth")
predictor = SamPredictor(sam)

# Set image
predictor.set_image(image)

# Segment with point prompt
masks, scores, logits = predictor.predict(
    point_coords=np.array([[500, 375]]),  # Click point
    point_labels=np.array([1]),            # 1 = foreground
    multimask_output=True
)

# Segment with box prompt
masks, scores, logits = predictor.predict(
    box=np.array([x1, y1, x2, y2])
)

Model Optimization Pipeline

Prepare models for production:

def optimize_for_deployment(model, sample_input):
    # Step 1: Export to ONNX
    torch.onnx.export(
        model,
        sample_input,
        "model.onnx",
        opset_version=13,
        dynamic_axes={"input": {0: "batch"}}
    )

    # Step 2: Quantize (INT8)
    from onnxruntime.quantization import quantize_dynamic
    quantize_dynamic(
        "model.onnx",
        "model_quantized.onnx",
        weight_type=QuantType.QInt8
    )

    # Step 3: Benchmark
    import onnxruntime as ort
    session = ort.InferenceSession("model_quantized.onnx")
    benchmark_inference(session, sample_input)

    return "model_quantized.onnx"

Common Pitfalls to Avoid

Training from scratch when transfer learning would work
Not augmenting data appropriately for the task
Data leakage through improper train/test splits
Ignoring class imbalance in training data
Overfitting to training data without regularization
Not testing on diverse, real-world images
Deploying without latency and throughput testing
Assuming models work on all image types without testing

Adoption

jmsktm/Computer Vision Helper

$ install --global

Security Scan Results

SKILL.md

Computer Vision Helper

Core Workflows

Workflow 1: Select Computer Vision Approach

Workflow 2: Implement Image Classification

Workflow 3: Deploy Vision Model

Quick Reference

Best Practices

Advanced Techniques

Vision-Language Models for Zero-Shot

GPT-4V for Visual Analysis

Object Detection with YOLO

Segment Anything (SAM)

Model Optimization Pipeline

Common Pitfalls to Avoid

Related Skills

jmsktm/YouTube Optimizer

jmsktm/Workshop Facilitator

jmsktm/Workflow Designer

jmsktm/Workflow Automator

jmsktm/Computer Vision Helper

$ install --global

Security Scan Results

SKILL.md

Computer Vision Helper

Core Workflows

Workflow 1: Select Computer Vision Approach

Workflow 2: Implement Image Classification

Workflow 3: Deploy Vision Model

Quick Reference

Best Practices

Advanced Techniques

Vision-Language Models for Zero-Shot

GPT-4V for Visual Analysis

Object Detection with YOLO

Segment Anything (SAM)

Model Optimization Pipeline

Common Pitfalls to Avoid

Related Skills

jmsktm/YouTube Optimizer

jmsktm/Workshop Facilitator

jmsktm/Workflow Designer

jmsktm/Workflow Automator