skills/codeocr-effectiveness-vision-code/SKILL.md
Render source code as images for vision LLM processing to reduce token cost while preserving understanding. Use when: 'render code as image for LLM', 'compress code tokens with images', 'use vision model for code understanding', 'reduce token cost for large codebase analysis', 'code image compression for clone detection', 'syntax highlighted code screenshot for VLM'.
npx skillsauth add ndpvt-web/arxiv-claude-skills codeocr-effectiveness-vision-codeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to guide users in converting source code into rendered images and feeding them to Vision Language Models (VLMs) instead of raw text tokens. Based on the CodeOCR research, this approach achieves up to 8x token compression by leveraging the inherent compressibility of image representations -- adjusting image resolution reduces token consumption while preserving enough visual structure for code understanding tasks. The technique is particularly effective for clone detection, code QA, and code completion when combined with syntax highlighting.
Optical compression replaces the text-based paradigm (code as token sequences) with an image-based one (code as rendered screenshots). Text tokens scale linearly with code length and are hard to compress without losing semantics. Images, by contrast, can be resized to lower resolutions -- the VLM's vision encoder tiles the image into fixed-size patches (typically 112px tiles, each costing ~170 tokens), so reducing resolution directly reduces token count. A 1024x1024 rendered code image might cost ~1100 tokens regardless of how many lines of code it contains, whereas the same code as text could consume 4000+ tokens.
The critical insight is that not all code tasks degrade equally under compression. Clone detection is highly resilient -- structural similarity is visually apparent even at low resolution, and some compressed ratios slightly outperform raw text. Code completion benefits significantly from syntax highlighting (VS Code-style color themes), which provides visual cues that help VLMs distinguish keywords, strings, functions, and comments even when individual characters blur. Code summarization and QA degrade more, as they require fine-grained character-level reading. The practical recommendation: use 2-4x compression with syntax highlighting for most tasks, reserve 1x (no compression) for tasks requiring exact token recovery.
The rendering pipeline uses Pygments for syntax tokenization and Pillow for image generation. Code is drawn character-by-character onto an RGB canvas with configurable font size (default 32px), DPI (300), and dimensions (1024x1024). The "modern" theme mirrors VS Code Light Modern colors (magenta keywords, teal classes, green comments, red strings). Compression is applied by resizing with LANCZOS resampling to a target resolution calculated from the desired token budget.
Install CodeOCR from the repository:
git clone https://github.com/YerbaPage/CodeOCR.git
cd CodeOCR
pip install -r requirements.txt
Requires Python >= 3.10, Pillow, Pygments, and tiktoken. GPU optional (only needed for embedding models in retrieval tasks).
Render code to images using the Python API with syntax highlighting enabled:
from CodeOCR import render_code_to_images
with open("target.py") as f:
code = f.read()
images = render_code_to_images(
code,
language="python",
enable_syntax_highlight=True,
theme="modern", # VS Code Light Modern colors
auto_optimize=True, # auto-adjust font/layout for content
width=1024,
height=1024,
font_size=32
)
images[0].save("rendered_code.png")
Calculate the baseline token costs to understand your compression ratio:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
text_tokens = len(enc.encode(code))
# Image tokens: ceil(width/112) * ceil(height/112) * 170
image_tokens = (1024 // 112 + 1) ** 2 * 170 # ~13770 for 1024x1024
Apply compression by resizing images to a target token budget:
from CodeOCR import compress_images
compressed = compress_images(
images,
text_tokens=text_tokens,
compression_ratio=4.0 # 4x fewer tokens than text
)
The compressor calculates image_token_limit = text_tokens / ratio, finds the closest resolution from the tile grid, and resizes with LANCZOS.
Choose the compression ratio based on task type:
Send rendered images to the VLM with a task-appropriate prompt:
from CodeOCR import create_client, call_llm_with_images
client = create_client() # reads OPENAI_API_KEY from env
response, token_info = call_llm_with_images(
client,
model_name="gpt-4o",
images=compressed,
system_prompt="You are a code analysis assistant.",
user_prompt="Are these two code snippets functionally equivalent?"
)
print(f"Response: {response}")
print(f"Tokens used: {token_info}")
For batch processing, render and compress in a loop, tracking token savings:
total_text_tokens = 0
total_image_tokens = 0
for filepath in code_files:
with open(filepath) as f:
code = f.read()
text_toks = len(enc.encode(code))
imgs = render_code_to_images(code, language="python", enable_syntax_highlight=True)
compressed = compress_images(imgs, text_tokens=text_toks, compression_ratio=4.0)
total_text_tokens += text_toks
total_image_tokens += sum(calculate_image_tokens(img) for img in compressed)
print(f"Compression: {total_text_tokens / total_image_tokens:.1f}x")
Use the CLI for quick experiments without writing code:
# Render a file to an image
python -m CodeOCR.demo render --file example.py -o output.png
# Query a VLM about rendered code
python -m CodeOCR.demo query --file example.py -i "Explain this function"
# OCR: render then reconstruct text (test round-trip fidelity)
python -m CodeOCR.demo ocr --file example.py
Run downstream task benchmarks to validate on standard datasets:
cd downstream
# Clone detection with image compression
python -u run_pipeline.py --run-tasks --task code_clone_detection \
--models gpt-4o --resize-mode --code-clone-detection-separate-mode
# Code completion with syntax highlighting
python -u run_pipeline.py --run-tasks --task code_completion_rag \
--models gpt-4o --resize-mode --preserve-newlines --enable-syntax-highlight
Evaluate round-trip fidelity with code reconstruction (RQ5) to gauge how much information is lost:
cd reconstruction
python run.py # renders -> OCR -> compares against original
Example 1: Reducing token cost for bulk clone detection
User: "I have 500 pairs of code files to check for clones. The text tokens would cost too much with GPT-4o. Can I use images instead?"
Approach:
render_code_to_images()compress_images() -- clone detection tolerates this wellOutput:
File pair: auth_v1.py vs auth_v2.py
Text tokens (both files): 3,200
Image tokens (4x compressed): 780
Result: "Yes, functionally equivalent. Both implement OAuth2 flow with
identical logic; v2 renames variables and extracts a helper method."
Savings: 4.1x token reduction
Example 2: Code completion with syntax highlighting boost
User: "I want to test whether sending code as a highlighted screenshot helps GPT-4o complete a function better than plain text at the same token budget."
Approach:
enable_syntax_highlight=True, theme="modern" and also with enable_syntax_highlight=FalseOutput:
Task: Complete `def merge_sorted_lists(a, b):`
With syntax highlighting (4x compression):
- Exact match: 42% | BLEU: 0.71
Without syntax highlighting (4x compression):
- Exact match: 35% | BLEU: 0.63
Raw text (no compression):
- Exact match: 45% | BLEU: 0.74
Syntax highlighting recovers most of the accuracy lost to compression.
Example 3: Quick code explanation at minimal token cost
User: "Explain this 200-line Python file but minimize API costs."
Approach:
render_code_to_images(code, language="python", enable_syntax_highlight=True)Output:
Text approach: 1,450 tokens input
Image approach (2x): 725 tokens input
Response: "This module implements a rate limiter using the token bucket
algorithm. Class `RateLimiter` manages per-client buckets with
configurable refill rates. Key functions: `acquire()` blocks until
a token is available, `try_acquire()` returns immediately..."
auto_optimize=True when rendering -- it adjusts font size and layout to fit the code naturally rather than clipping or leaving empty space.pip install Pygments. Without it, you lose the syntax highlighting benefit.pip install tiktoken.Paper: CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding (Shi et al., 2026). Key sections: Section 4 for task-specific compression results, Section 5 for the syntax highlighting ablation, Table 3 for the compression-ratio-vs-accuracy tradeoffs across tasks. Code: github.com/YerbaPage/CodeOCR.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".