skills/gdcnet-generative-discrepancy-comparison/SKILL.md
Detect sarcasm and semantic incongruity in multimodal (image+text) content using the GDCNet three-channel discrepancy comparison approach. Generates objective image descriptions as semantic anchors, then computes semantic, sentiment, and fidelity discrepancies against the original text to surface contradictions that signal sarcasm, irony, or misleading content. Trigger phrases: - "detect sarcasm in this image and text" - "check if this social media post is sarcastic" - "find contradictions between the image and caption" - "build a multimodal sarcasm detector" - "analyze image-text incongruity" - "implement discrepancy-based sarcasm detection"
npx skillsauth add ndpvt-web/arxiv-claude-skills gdcnet-generative-discrepancy-comparisonInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to build and apply multimodal sarcasm detection systems using the GDCNet approach. Instead of naively comparing image and text embeddings (which fails when content is loosely related or semantically indirect), GDCNet generates a factual, objective description of the image using a Multimodal LLM, then computes three targeted discrepancy channels -- semantic, sentiment, and visual-textual fidelity -- between that grounded description and the original text. A gated fusion module adaptively combines these signals with raw visual and textual features for classification. This technique generalizes beyond sarcasm to any task requiring detection of cross-modal contradictions or incongruity.
The core insight: Direct embedding comparison between images and text is brittle -- an image of a sunny beach paired with "What a wonderful Monday at work" won't show obvious embedding misalignment because neither modality is inherently negative. GDCNet solves this by introducing a semantic anchor: an objective, factual description of the image generated by a Multimodal LLM (the paper uses LLaVA-NeXT). This caption -- e.g., "A person sitting at a desk in a dimly lit office cubicle" -- becomes the stable reference against which the original text is compared. The discrepancy between "dimly lit office cubicle" and "wonderful Monday" is far more detectable than raw image-text embedding distance.
Three discrepancy channels capture different aspects of incongruity. (1) Semantic discrepancy measures how much the meaning of the generated caption diverges from the original text using encoded representations (e.g., BERT embeddings). (2) Sentiment discrepancy captures emotional contradictions -- the caption might describe a neutral or negative scene while the text expresses exaggerated positivity. (3) Visual-textual fidelity quantifies how faithfully the original text describes what is actually in the image, catching cases where the text simply ignores or misrepresents visual content.
Gated fusion prevents any single channel from dominating. A learnable gating mechanism assigns dynamic weights to each discrepancy channel and the raw visual/textual features, allowing the model to rely more on sentiment discrepancy for emotionally charged sarcasm but shift to semantic discrepancy for factual contradictions. This is implemented as gate functions with learnable parameters that produce a unified representation fed to a classification head.
Ingest the image-text pair. Load the image and its associated text (social media caption, meme text, article headline, etc.) as separate inputs. Validate that both modalities are present; fall back to text-only analysis if no image exists.
Generate an objective image description. Pass the image through a Multimodal LLM (LLaVA-NeXT, GPT-4V, or Claude's own vision) with a prompt like: "Describe the contents of this image in factual, objective terms. Do not interpret emotions, intent, or sarcasm. Focus on what is visually present: objects, people, actions, setting, colors, text overlays." This caption serves as the semantic anchor.
Encode all three text signals. Using a text encoder (BERT, RoBERTa, or sentence-transformers), produce embeddings for: (a) the original text, (b) the generated image description, and (c) optionally, any text extracted from the image via OCR. Store these as vectors for downstream comparison.
Compute semantic discrepancy. Calculate the cosine distance (or other divergence metric) between the embedding of the generated image description and the original text. High distance indicates the text says something semantically different from what the image depicts. Normalize this score to [0, 1].
Compute sentiment discrepancy. Run sentiment analysis on both the generated description and the original text (using a sentiment classifier or a simple positive/neutral/negative scorer). Compute the absolute difference in sentiment polarity scores. A positive description paired with negative text (or vice versa) yields a high sentiment discrepancy.
Compute visual-textual fidelity. Measure how much the original text actually describes the image content. Use the CLIP similarity score between the image and the original text, or compute entailment probability between the generated caption and the original text using an NLI model. Low fidelity means the text ignores or contradicts what's in the image.
Extract raw modality features. Encode the image through a vision encoder (CLIP ViT, ResNet) and the text through the text encoder. These raw representations capture information beyond what the discrepancy channels measure.
Apply gated fusion. Concatenate the three discrepancy scores with the raw visual and textual feature vectors. Pass through a gating network: g = sigmoid(W * [d_sem, d_sent, d_fid, v, t] + b), then compute the fused representation as h = g * [d_sem, d_sent, d_fid, v, t]. This lets the model learn which signals matter for each input.
Classify. Feed the fused representation through a classification head (linear layer + softmax) to produce sarcasm probability. Apply a threshold (default 0.5) for binary classification.
Return structured results. Output the classification label, confidence score, and the individual discrepancy scores so the user can interpret why the system flagged something as sarcastic (e.g., "High sentiment discrepancy: image shows a traffic jam but text says 'Love my commute!'").
Example 1: Social Media Sarcasm Detection Pipeline
User: "Build a sarcasm detection system for tweets that have images."
Approach:
transformers, sentence-transformers, and clip librariesimport torch
from sentence_transformers import SentenceTransformer
from transformers import pipeline
# Step 1: Load models
text_encoder = SentenceTransformer('all-MiniLM-L6-v2')
sentiment_analyzer = pipeline('sentiment-analysis', model='cardiffnlp/twitter-roberta-base-sentiment-latest')
# Step 2: Given an image caption (from MLLM) and original text
generated_caption = "A person sitting alone at a desk covered in paperwork in a gray office"
original_text = "Living my best life! #blessed #MondayMotivation"
# Step 3: Semantic discrepancy
emb_caption = text_encoder.encode(generated_caption)
emb_original = text_encoder.encode(original_text)
semantic_disc = 1 - torch.nn.functional.cosine_similarity(
torch.tensor(emb_caption).unsqueeze(0),
torch.tensor(emb_original).unsqueeze(0)
).item() # 0.72 -- high divergence
# Step 4: Sentiment discrepancy
sent_caption = sentiment_analyzer(generated_caption)[0] # {'label': 'negative', 'score': 0.68}
sent_original = sentiment_analyzer(original_text)[0] # {'label': 'positive', 'score': 0.94}
sentiment_disc = abs(sent_caption['score'] * (-1 if 'neg' in sent_caption['label'] else 1)
- sent_original['score'] * (-1 if 'neg' in sent_original['label'] else 1))
# sentiment_disc = 1.62 (rescale to [0,1] -> 0.81)
# Step 5: Fidelity via NLI entailment
# nli_model.predict(premise=generated_caption, hypothesis=original_text)
# entailment_score = 0.08 -> fidelity_disc = 1 - 0.08 = 0.92
# Result: All three channels show high discrepancy -> sarcasm detected (confidence: 0.91)
Output:
Classification: SARCASTIC (confidence: 0.91)
Discrepancy breakdown:
Semantic: 0.72 (text meaning differs significantly from image content)
Sentiment: 0.81 (image conveys negativity, text conveys positivity)
Fidelity: 0.92 (text does not describe image content)
Generated anchor: "A person sitting alone at a desk covered in paperwork in a gray office"
Example 2: Non-Sarcastic Content Verification
User: "Check if this product review with photo is genuine or sarcastic."
Image description (generated): "A close-up of a leather wallet with multiple card slots, brown color, on a wooden table" Original text: "Great quality leather wallet, fits all my cards perfectly. Highly recommend."
Discrepancy analysis:
Semantic: 0.15 (text aligns well with image content)
Sentiment: 0.08 (both mildly positive)
Fidelity: 0.12 (text accurately describes shown product)
Classification: NOT SARCASTIC (confidence: 0.94)
Example 3: Meme Incongruity Analysis
User: "Analyze this meme for sarcasm -- it shows a house on fire with the caption 'This is fine.'"
Approach:
Discrepancy analysis:
Semantic: 0.85 (extreme danger vs. casual acceptance)
Sentiment: 0.78 (disaster scene vs. calm reassurance)
Fidelity: 0.88 (text ignores the fire entirely)
Classification: SARCASTIC (confidence: 0.96)
Interpretation: The text deliberately understates the catastrophic visual content,
a classic sarcasm pattern where language contradicts observable reality.
twitter-roberta-base) when working with tweets or memes, since formal language models may misencode slang and hashtags.Paper: GDCNet: Generative Discrepancy Comparison Network for Multimodal Sarcasm Detection (ICASSP 2026) Key takeaway: Using MLLM-generated objective image descriptions as semantic anchors and computing three complementary discrepancy channels (semantic, sentiment, fidelity) outperforms direct cross-modal embedding comparison for sarcasm detection, achieving state-of-the-art on MMSD2.0.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".