skills/event-vstream-event-driven-real-time-understanding/SKILL.md
Build event-driven video stream processing pipelines that detect meaningful state transitions instead of processing every frame. Use when asked to: 'build a real-time video understanding system', 'detect events in a video stream', 'process long video with memory', 'reduce redundant frame processing', 'stream video to LLM efficiently', 'build an event-aware video pipeline'.
npx skillsauth add ndpvt-web/arxiv-claude-skills event-vstream-event-driven-real-time-understandingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design and implement event-driven video stream processing systems based on the Event-VStream framework. Instead of processing every frame at fixed intervals (which wastes compute on redundant content and forgets past context), Event-VStream detects semantically meaningful state transitions by fusing motion, semantic, and predictive cues, then triggers language generation only at those boundaries. A persistent memory bank consolidates event embeddings for long-horizon reasoning. This approach achieves competitive accuracy while maintaining sub-100ms latency across multi-hour streams.
Event-driven processing replaces fixed-interval decoding. Traditional streaming video-LLM systems sample frames at a constant rate (e.g., every 0.5s) and decode text continuously. This produces repetitive outputs during static scenes and misses fast transitions. Event-VStream flips this: it monitors a lightweight boundary score fusing three complementary signals, and only invokes the expensive language model when a genuine state transition is detected.
Three-signal boundary detection. The boundary score E_t combines: (1) semantic drift -- cosine distance between the current frame embedding and a running event average, catching content changes; (2) motion cue -- normalized optical flow or frame-difference energy, which empirically precedes semantic drift by ~2 seconds and acts as an early warning; (3) prediction error -- the L2 error of a lightweight 3-layer MLP that predicts the next frame embedding from the previous one, catching unexpected transitions. An adaptive threshold tau_t tightens during high-motion segments (to avoid false triggers from continuous motion) and relaxes during stable scenes. Ablation shows all three signals are essential: removing motion drops win rate from 68% to 12%, removing semantics drops it to 38%, and removing prediction drops it to 47%.
Persistent event memory with merge-or-append. When a boundary fires, frames within the detected segment are aggregated into an event embedding using Gaussian-weighted pooling (emphasizing frames near the boundary). This embedding is either merged into the most recent memory slot (if cosine similarity exceeds a redundancy threshold gamma_mem) or appended as a new entry. This keeps the memory bank compact -- semantically similar consecutive events consolidate rather than accumulating. At generation time, relevant past events are retrieved by similarity to the current event, giving the language model long-horizon context without growing the token budget.
Set up the frame ingestion loop. Accept video frames from a stream source (webcam, RTSP, file) at a fixed capture rate (2 FPS is the paper's default). Extract a visual embedding f_t for each frame using a pretrained vision encoder (CLIP, SigLIP, or a VideoLLM-Online encoder).
Maintain a running event representation. Keep an exponential moving average f_bar of frame embeddings within the current event segment: f_bar <- (1 - rho) * f_bar + rho * f_t. This serves as the "what the current event looks like" anchor.
Compute the three-signal boundary score. For each frame, calculate:
(1 - cosine_similarity(f_t, f_bar))m_tilde_tc_t = ||MLP(f_{t-1}) - f_t||^2 using a lightweight 3-layer MLPE_t = w_sem * (1 - s_t) + w_mot * m_tilde_t + w_pred * c_tApply the adaptive threshold. Compute tau_t = tau_0 * (1 + eta * Var(m_{t-w:t})) where tau_0 = 0.96 and eta = 0.03. A boundary fires when sigmoid(E_t) > tau_t. Enforce a minimum interval Delta_min between triggers to coalesce bursty updates, and a maximum interval Delta_max to prevent excessive silence.
Aggregate the event embedding on boundary detection. Collect all frame embeddings since the last boundary. Compute a Gaussian-weighted average: E_k = sum(w_i * f_i) / sum(w_i) where w_i ~ exp(-|t_i - t_boundary| / sigma). This emphasizes frames near the transition point.
Update the persistent memory bank. Apply the merge-or-append rule: if cosine_similarity(E_k, E_last) > gamma_mem, merge via E_last <- (1 - lambda) * E_last + lambda * E_k. Otherwise, append E_k as a new memory slot.
Retrieve relevant context and generate text. Query the memory bank for the top-K events most similar to E_k. Concatenate these retrieved event embeddings with E_k as visual context tokens, prepend a system prompt, and invoke the language model to produce a description or answer: y_k = LM(E_k, Retrieve(Memory, E_k)).
Between boundaries, remain silent. Do not invoke the language model. Continue the frame ingestion and scoring loop. This is the key efficiency gain -- the LM is called only when something meaningful happens.
Reset the running average. After a boundary fires, reset f_bar to the current frame embedding to begin tracking the next event segment.
Implement pacing controls for production use. Add a minimum silence interval (prevents chattering during rapid scene changes) and a maximum silence interval (forces a periodic update even during very long static scenes for liveness).
Example 1: Building a security camera event narrator
User: "I want a system that watches a security camera feed and only generates descriptions when something actually happens -- not every frame."
Approach:
tau_0=0.96, eta=0.03).gamma_mem=0.85).Output:
import numpy as np
from collections import deque
class EventBoundaryDetector:
def __init__(self, tau_0=0.96, eta=0.03, window=30, rho=0.05,
w_sem=0.4, w_mot=0.35, w_pred=0.25,
delta_min=2.0, delta_max=30.0):
self.tau_0 = tau_0
self.eta = eta
self.w_sem = w_sem
self.w_mot = w_mot
self.w_pred = w_pred
self.rho = rho
self.delta_min = delta_min
self.delta_max = delta_max
self.f_bar = None # Running event average
self.motion_history = deque(maxlen=window)
self.last_boundary_time = -float('inf')
self.predictor = None # Lightweight MLP, set externally
def cosine_sim(self, a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-8)
def update(self, f_t, motion_energy, timestamp):
if self.f_bar is None:
self.f_bar = f_t.copy()
self.last_boundary_time = timestamp
return False
# Signal 1: Semantic drift
sem_drift = 1.0 - self.cosine_sim(f_t, self.f_bar)
# Signal 2: Normalized motion
self.motion_history.append(motion_energy)
m_max = max(self.motion_history) if self.motion_history else 1.0
m_norm = motion_energy / (m_max + 1e-8)
# Signal 3: Prediction error
pred_error = 0.0
if self.predictor is not None:
pred_error = self.predictor.prediction_error(f_t)
# Combined boundary score
E_t = (self.w_sem * sem_drift +
self.w_mot * m_norm +
self.w_pred * pred_error)
p_t = 1.0 / (1.0 + np.exp(-E_t)) # sigmoid
# Adaptive threshold
motion_var = np.var(list(self.motion_history)) if len(self.motion_history) > 1 else 0.0
tau_t = self.tau_0 * (1.0 + self.eta * motion_var)
# Pacing: enforce min/max intervals
elapsed = timestamp - self.last_boundary_time
if elapsed < self.delta_min:
fire = False
elif elapsed > self.delta_max:
fire = True # Force periodic update
else:
fire = p_t > tau_t
if fire:
self.f_bar = f_t.copy() # Reset running average
self.last_boundary_time = timestamp
# Update running average (EMA)
self.f_bar = (1 - self.rho) * self.f_bar + self.rho * f_t
return fire
Example 2: Long video indexing with event-based retrieval
User: "I have 4 hours of conference talk recordings. I want to index them so users can search for specific moments by natural language query."
Approach:
Output:
class EventMemoryBank:
def __init__(self, gamma_mem=0.85, merge_lambda=0.3):
self.events = [] # List of {"embedding": ..., "start": ..., "end": ..., "description": ...}
self.gamma_mem = gamma_mem
self.merge_lambda = merge_lambda
def add_event(self, embedding, start_time, end_time):
if self.events:
sim = cosine_sim(embedding, self.events[-1]["embedding"])
if sim > self.gamma_mem:
# Merge: consolidate similar consecutive events
last = self.events[-1]
last["embedding"] = ((1 - self.merge_lambda) * last["embedding"]
+ self.merge_lambda * embedding)
last["end"] = end_time
return len(self.events) - 1
# Append as new event
self.events.append({
"embedding": embedding,
"start": start_time,
"end": end_time,
"description": None
})
return len(self.events) - 1
def retrieve(self, query_embedding, top_k=5):
scores = [(i, cosine_sim(query_embedding, e["embedding"]))
for i, e in enumerate(self.events)]
scores.sort(key=lambda x: -x[1])
return [self.events[i] for i, _ in scores[:top_k]]
Example 3: Adapting an existing video-LLM for streaming
User: "I have a batch video QA model. How do I convert it to handle real-time streams without running out of memory on long videos?"
Approach:
Output architecture:
Frame Stream (2 FPS)
|
v
Vision Encoder (CLIP/SigLIP) --> f_t
|
v
Boundary Detector (motion + semantic + prediction)
|
|-- No boundary: update EMA, continue
|
|-- Boundary detected:
|
v
Event Aggregator (Gaussian-weighted pooling)
|
v
Memory Bank (merge-or-append)
|
v
Retrieve top-K past events
|
v
LLM Decoder (current event + retrieved context -> text)
|
v
Output description / answer
gamma_mem for your use case. Lower values (0.7) create more granular events; higher values (0.9) aggressively merge similar segments. Start at 0.85 and adjust based on event density.Delta_min, Delta_max). Without them, rapid scene changes (e.g., channel surfing) cause decoder flooding, and static scenes (e.g., parking lot at night) produce no output for hours.eta * Var(motion)) prevents false triggers during sustained motion (e.g., a person continuously walking) while staying sensitive during calm periods.tau_0 or Delta_min. Check if the motion signal is noisy -- apply temporal smoothing (e.g., 3-frame moving average) before feeding it to the detector.tau_0 or check that the vision encoder produces meaningfully different embeddings for different scenes. A poorly-trained encoder yields flat similarity scores.gamma_mem is not set too low. Add a hard cap (e.g., 500 events) with FIFO eviction of oldest entries if needed.eta parameter needs tuning per camera setup.Paper: Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams (Guo et al., 2026). Focus on Section 3 (boundary detection formulas and adaptive threshold), Section 4 (memory bank merge-or-append rule), and Table 2 (ablation showing contribution of each signal).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".