skills/compact-hypercube-embeddings-fast/SKILL.md
Build fast similarity-search systems using compact binary hypercube embeddings derived from foundation model encoders. Replaces brute-force cosine similarity over float vectors with Hamming distance over binary codes for orders-of-magnitude speedup and memory reduction. Trigger phrases: 'binary hashing for retrieval', 'fast embedding search', 'compact embeddings for similarity', 'Hamming space retrieval', 'hash-based vector search', 'reduce embedding memory footprint'
npx skillsauth add ndpvt-web/arxiv-claude-skills compact-hypercube-embeddings-fastInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill teaches Claude to design and implement retrieval systems that use compact binary hypercube embeddings instead of dense floating-point vectors. The core idea, from the Cross-View Alignment Hashing (CVA-Hash) framework, is to project pretrained encoder outputs through a learned hashing layer, binarize them with a sign function, and search in Hamming space instead of Euclidean/cosine space. This yields 32-256x memory reduction and enables sub-millisecond search over millions of items using bitwise XOR + popcount operations.
Cross-View Alignment Hashing (CVA-Hash) takes two pretrained encoders (e.g., a text encoder and an image encoder from CLIP) and attaches a lightweight hashing head to each. Each head is a small MLP that projects the encoder's continuous embedding (e.g., 512-d float32) down to a target bit length (e.g., 64 bits), then applies a sign() function to produce binary codes in {-1, +1}^K. At inference, these are stored as packed bit arrays. Retrieval becomes a Hamming distance computation: XOR two bit strings and count the 1s. This is a single CPU instruction per 64-bit word, making brute-force search over millions of codes feasible in milliseconds.
Training uses three losses jointly: (1) A contrastive alignment loss that pulls matching text-image (or text-audio) hash codes together and pushes non-matching codes apart in Hamming space, mirroring the InfoNCE objective used in CLIP but operating on binary codes. (2) A quantization loss that penalizes the gap between the continuous pre-sign activations and their binarized outputs, encouraging the network to produce activations near +1 or -1 so binarization loses minimal information. (3) Optionally, a bit-balance regularization that encourages each bit to be +1 or -1 with roughly equal probability across the dataset, maximizing the information entropy of the code. During training, the sign function's zero gradient is bypassed using the straight-through estimator (STE): gradients flow through sign() as if it were the identity function.
Parameter-efficient fine-tuning (PEFT) keeps the pretrained encoders mostly frozen. The paper applies LoRA adapters (rank 4-16) to the encoder's attention layers and trains only the LoRA weights plus the hashing head. This means the entire adaptation can be done on a single GPU in hours, not days. The resulting system inherits the zero-shot generalization of the foundation model while gaining efficient binary retrieval. Crucially, the hashing objective was found to improve the underlying encoder representations, yielding better retrieval even when evaluated with continuous embeddings.
Select foundation encoders for each modality. For text-image retrieval, use CLIP or BioCLIP. For text-audio, use BioLingual or CLAP. Load them with their pretrained weights and freeze all parameters initially. Identify the embedding dimension (e.g., 512 or 768).
Design the hashing head. Create a small MLP per modality: Linear(embed_dim, embed_dim) -> BatchNorm -> ReLU -> Linear(embed_dim, K) where K is the target hash bit length (32, 64, or 128). At inference, apply sign() to the K-dimensional output. During training, use the straight-through estimator.
Implement the straight-through estimator for sign(). In PyTorch:
class SignSTE(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
return x.sign()
@staticmethod
def backward(ctx, grad_output):
return grad_output # pass gradient through unchanged
Implement the training losses. Combine three terms:
mean(abs(abs(h) - 1)) where h is the pre-sign continuous output, penalizing values near zero.mean(abs(mean(codes, dim=0))) — penalizes bits that are consistently +1 or -1 across the batch.L = L_align + alpha * L_quant + beta * L_balance with alpha=0.1, beta=0.01 as starting points.Attach LoRA adapters to the encoders (optional but recommended). Use peft library to add rank-8 LoRA to the query/value projection matrices of the encoder's transformer layers. This unfreezes ~1-2% of parameters and significantly improves hash quality over training only the hashing head.
Train on paired data. Feed text-observation pairs through their respective encoders + hashing heads. Use a batch size of 256-1024 with in-batch negatives for the contrastive loss. Train for 10-30 epochs with AdamW (lr=1e-4 for hashing heads, lr=1e-5 for LoRA weights). Use cosine annealing.
Binarize and pack the database embeddings. After training, encode every item in the database through the observation encoder + hashing head + sign(). Convert {-1, +1} to {0, 1} and pack into numpy.packbits or uint64 arrays. A 64-bit code per item means 1 million items = 8 MB.
Implement Hamming distance search. At query time, encode the text query the same way, pack its bits, and compute Hamming distance against all database codes using XOR + popcount:
import numpy as np
def hamming_search(query_bits, db_bits):
# query_bits: (K//8,) uint8, db_bits: (N, K//8) uint8
xor = np.bitwise_xor(db_bits, query_bits)
distances = np.unpackbits(xor, axis=1).sum(axis=1)
return np.argsort(distances)
Evaluate with mAP@K and Precision@K. Compare binary retrieval against continuous-embedding retrieval (cosine similarity on the raw encoder outputs) using mean Average Precision at K=1, 5, 10, 20. Expect binary retrieval to reach 85-100% of continuous performance at 64+ bits while being 50-100x faster.
Deploy as a two-stage pipeline (optional). For maximum quality, use binary search to retrieve the top-100 candidates, then re-rank them with continuous embeddings. This gives near-exact retrieval quality with near-binary search speed.
Example 1: Building a wildlife image search engine
User: "I have 500K iNaturalist images with CLIP embeddings stored as float32 vectors. Search is too slow and uses 1GB of RAM. Help me compress these for faster retrieval."
Approach:
Output:
import torch
import numpy as np
from transformers import CLIPModel, CLIPProcessor
from peft import get_peft_model, LoraConfig
# 1. Load CLIP and attach hashing head
clip = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
HASH_BITS = 64
class HashingHead(torch.nn.Module):
def __init__(self, input_dim, hash_bits):
super().__init__()
self.fc = torch.nn.Sequential(
torch.nn.Linear(input_dim, input_dim),
torch.nn.BatchNorm1d(input_dim),
torch.nn.ReLU(),
torch.nn.Linear(input_dim, hash_bits),
)
def forward(self, x):
h = self.fc(x)
if self.training:
return h, SignSTE.apply(h)
return h.sign()
text_hash = HashingHead(512, HASH_BITS)
image_hash = HashingHead(512, HASH_BITS)
# 2. Encode entire database → binary
with torch.no_grad():
for batch in dataloader:
emb = clip.get_image_features(pixel_values=batch)
codes = image_hash(emb) # (B, 64) in {-1, +1}
bits = ((codes + 1) / 2).byte() # convert to {0, 1}
packed = np.packbits(bits.numpy(), axis=1) # (B, 8) uint8
db_codes.append(packed)
db_codes = np.concatenate(db_codes) # (500000, 8) = 4 MB
# 3. Query
query_emb = clip.get_text_features(**processor(text=["red-tailed hawk flying"]))
query_code = text_hash(query_emb).sign()
query_bits = np.packbits(((query_code + 1) / 2).byte().numpy(), axis=1)
distances = np.unpackbits(np.bitwise_xor(db_codes, query_bits), axis=1).sum(1)
top_k = np.argsort(distances)[:20]
Example 2: Adding binary search to an existing audio monitoring pipeline
User: "I have a BioLingual model encoding bird call spectrograms. I want to search 2 million recordings by text description like 'woodpecker drumming on dead tree'. Current FAISS index is 3 GB."
Approach:
Output:
# Memory comparison
continuous_storage = 2_000_000 * 768 * 4 # 6.1 GB (float32, 768-d)
binary_storage = 2_000_000 * 16 # 32 MB (128-bit codes)
compression_ratio = continuous_storage / binary_storage # 192x
# Search speed comparison (single-threaded, approximate)
# Continuous cosine similarity: ~2000ms for 2M @ 768-d
# Binary Hamming distance: ~5ms for 2M @ 128-bit
# Speedup: ~400x
Example 3: Zero-shot domain transfer for soundscape monitoring
User: "I trained hash codes on iNatSounds but need to deploy on a rainforest soundscape dataset I don't have labels for. Will it generalize?"
Approach:
Paper: Compact Hypercube Embeddings for Fast Text-based Wildlife Observation Retrieval (Moummad et al., 2026). Focus on Section 3 (CVA-Hash framework), Section 4 (loss functions and training), and Tables 1-3 (mAP comparisons across bit lengths showing binary codes matching or exceeding continuous baselines).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".