skills/diffusion-pretrained-dense-contextual-embeddings/SKILL.md
Build production retrieval systems using pplx-embed, diffusion-pretrained dense and contextualized embedding models with INT8 quantization, late chunking for long documents, and multi-stage contrastive training. Use when: 'build a semantic search pipeline', 'set up document retrieval with contextual embeddings', 'implement late chunking for long documents', 'create a multilingual search index', 'optimize embedding storage with quantization', 'add contextualized passage retrieval to RAG'.
npx skillsauth add ndpvt-web/arxiv-claude-skills diffusion-pretrained-dense-contextual-embeddingsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design and implement high-quality retrieval systems using the pplx-embed architecture: a family of embedding models built on diffusion-pretrained bidirectional transformers with multi-stage contrastive learning, INT8 quantization-aware training, and a late chunking strategy that preserves global document context across passage boundaries. The core insight is that diffusion-based pretraining (masking tokens via an absorbing-state process and reconstructing them with bidirectional attention) produces a stronger backbone for embeddings than causal language models, and that a four-stage contrastive pipeline (pair, contextual, triplet, SLERP merge) yields state-of-the-art dense retrieval with efficient INT8 or binary representations.
Diffusion pretraining as an embedding backbone. Standard embedding models start from causal (left-to-right) language models and bolt on bidirectional attention at fine-tuning time. pplx-embed instead converts a Qwen3 base model into a true bidirectional encoder via continued pretraining with a diffusion objective: at each step, tokens are independently masked with probability t (continuous time, t in [0,1]) and the model learns to reconstruct them using full bidirectional self-attention. This is trained for 60K steps on ~250B multilingual tokens with sequence length 4,096. The result is a backbone that natively captures bidirectional context, which yields ~1% average improvement on retrieval benchmarks compared to causal-only pretraining.
Multi-stage contrastive learning. Rather than a single contrastive fine-tuning pass, pplx-embed uses four stages: (1) Pair training with InfoNCE loss, in-batch negatives, and false-negative masking; (2) Contextual training that adds a dual-objective loss combining local chunk-level and global document-level contrastive signals (this is what powers the context-v1 variant); (3) Triplet training with hard negatives for final ranking quality; and (4) SLERP model merging that spherically interpolates the contextual and triplet checkpoints to combine their strengths. The final embeddings are mean-pooled and quantized to INT8 via a tanh-based formula with straight-through gradient estimation during training.
Late chunking for long documents. Instead of independently embedding each chunk, the model encodes the full document (up to 16 chunks of 256 tokens = 4,096 tokens) with bidirectional attention, then pools each chunk's token representations separately. This means each chunk embedding is informed by the entire document's context, solving the classic problem of context loss at chunk boundaries. The context-v1 variant adds a global document-level embedding objective on top of this, setting records on the ConTEB benchmark.
Assess retrieval requirements. Determine: corpus size (thousands vs. millions of documents), average document length, number of languages, whether passages need document-level context, latency budget, and storage constraints. This dictates model size (0.6B vs. 4B) and quantization (float32 vs. INT8 vs. binary).
Choose the model variant. Use pplx-embed-v1 (standard dense retrieval) when passages are self-contained or short. Use pplx-embed-context-v1 when passages come from long documents and retrieval quality depends on surrounding context (legal discovery, technical documentation, book search). The context variant adds ~5% on contextual benchmarks but requires encoding full documents rather than isolated passages.
Implement the chunking strategy. Split documents into 256-token chunks (the model's native chunk size). For the context variant, group chunks into batches of up to 16 per document (4,096 tokens). If documents exceed 4,096 tokens, use a sliding window of 16 chunks with overlap. Maintain a mapping from chunk IDs back to source documents and byte offsets.
Encode passages with late chunking. Feed the full multi-chunk sequence through the encoder with bidirectional attention. Extract per-chunk embeddings by mean-pooling the token representations within each chunk's span. Apply the INT8 quantization formula: floor(127 * tanh(mean_pool) + 0.5) to produce integer embeddings in [-127, 127]. This is natively supported if using the model's built-in pooling; replicate it if building a custom pipeline.
Encode queries. Queries are typically short (under 256 tokens) and do not need late chunking. Encode each query as a single sequence and mean-pool to get the query embedding. Apply the same INT8 quantization. No instruction prefix is needed -- pplx-embed is instruction-free.
Build the retrieval index. Store INT8 embeddings in a vector database (FAISS IVF with IndexBinaryFlat for binary, or IndexIVFScalarQuantizer for INT8). For the 4B model, embeddings are 2,560-dimensional; for 0.6B, 1,024-dimensional. INT8 reduces storage by 4x vs. float32; binary by 16x (at a 1-1.6% quality cost for 4B).
Implement retrieval and reranking. At query time, compute the query embedding, retrieve top-k candidates via approximate nearest neighbor search (dot product on INT8 vectors), then optionally rerank with full float32 embeddings or a cross-encoder for the top 50-100 results.
Integrate with RAG or downstream systems. Pass retrieved passages (with their document-level context metadata) to the generation model. For context-v1 embeddings, the retrieved chunks already encode surrounding context, reducing the need to fetch adjacent chunks as extra context.
Evaluate on standard benchmarks. Measure nDCG@10 on MTEB Multilingual v2 (target: ~69.7% for 4B INT8), Recall@1000 on large corpora, and ConTEB for contextual retrieval. Compare against baselines like Qwen3-Embedding and voyage-context-3.
Monitor and tune in production. Track query latency, recall@k at different k values, and embedding freshness. Retune the FAISS index parameters (nprobe, nlist) as the corpus grows. Consider binary quantization if storage becomes the bottleneck on corpora exceeding 10M documents.
Example 1: Building a multilingual documentation search system
User: "I need to build a search system over our product documentation in 12 languages. Documents are 2-10 pages each. We have about 500K documents."
Approach:
import numpy as np
from transformers import AutoModel, AutoTokenizer
# Load model (pseudocode -- adjust for actual model release)
tokenizer = AutoTokenizer.from_pretrained("perplexity/pplx-embed-context-v1-4b")
model = AutoModel.from_pretrained("perplexity/pplx-embed-context-v1-4b")
CHUNK_SIZE = 256
MAX_CHUNKS = 16
def late_chunk_encode(document_text: str) -> list[np.ndarray]:
"""Encode a document with late chunking, returning per-chunk INT8 embeddings."""
tokens = tokenizer(document_text, return_tensors="pt", max_length=CHUNK_SIZE * MAX_CHUNKS, truncation=True)
outputs = model(**tokens)
hidden = outputs.last_hidden_state[0] # (seq_len, dim)
chunk_embeddings = []
seq_len = hidden.shape[0]
for start in range(0, seq_len, CHUNK_SIZE):
end = min(start + CHUNK_SIZE, seq_len)
chunk_mean = hidden[start:end].mean(dim=0) # mean pooling per chunk
# INT8 quantization: floor(127 * tanh(v) + 0.5)
quantized = np.floor(127.0 * np.tanh(chunk_mean.detach().numpy()) + 0.5).astype(np.int8)
chunk_embeddings.append(quantized)
return chunk_embeddings
def encode_query(query_text: str) -> np.ndarray:
"""Encode a query as a single INT8 embedding."""
tokens = tokenizer(query_text, return_tensors="pt", max_length=CHUNK_SIZE, truncation=True)
outputs = model(**tokens)
query_mean = outputs.last_hidden_state[0].mean(dim=0)
return np.floor(127.0 * np.tanh(query_mean.detach().numpy()) + 0.5).astype(np.int8)
Example 2: Tool/API retrieval from a large function registry
User: "We have 15,000 internal API endpoints documented as JSON specs. Users describe what they want in natural language and we need to find the right API."
Approach:
# Index construction
import faiss
dim = 2560
index = faiss.IndexFlatIP(dim) # inner product for normalized embeddings
api_embeddings = []
for api_spec in api_registry:
text = f"{api_spec['name']}: {api_spec['description']}. Params: {api_spec['params_summary']}"
emb = encode_passage(text) # returns float32 normalized
api_embeddings.append(emb)
embeddings_matrix = np.stack(api_embeddings).astype(np.float32)
faiss.normalize_L2(embeddings_matrix)
index.add(embeddings_matrix)
# Query
query = "Find an API that lets me resize an image and convert it to WebP format"
query_emb = encode_query(query).reshape(1, -1).astype(np.float32)
faiss.normalize_L2(query_emb)
scores, indices = index.search(query_emb, k=5)
# indices[0] contains the top-5 matching API IDs
Example 3: RAG with contextual passage retrieval
User: "Our RAG pipeline retrieves passages from 50-page contracts but often returns chunks that are meaningless without surrounding context. How do I fix this?"
Approach:
Before (standard chunking):
Chunk: "The party shall indemnify for losses described in Section 4.2."
Problem: "Section 4.2" is meaningless without the rest of the document.
After (late chunking with pplx-embed-context-v1):
Same chunk's embedding now encodes that Section 4.2 covers "intellectual property infringement"
because the bidirectional attention saw the full document during encoding.
Retrieval for "IP liability" now correctly surfaces this chunk.
Paper: Diffusion-Pretrained Dense and Contextual Embeddings (Eslami et al., 2026). Look for: Section 2 on diffusion pretraining objective, Section 3 on the four-stage contrastive pipeline, Section 4 on late chunking and contextual embeddings, and Table 1-5 for benchmark comparisons against Qwen3-Embedding, voyage-3, and NV-Embed.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".