GDCNet: Generative Discrepancy Comparison for Multimodal Sarcasm Detection

This skill enables Claude to build and apply multimodal sarcasm detection systems using the GDCNet approach. Instead of naively comparing image and text embeddings (which fails when content is loosely related or semantically indirect), GDCNet generates a factual, objective description of the image using a Multimodal LLM, then computes three targeted discrepancy channels -- semantic, sentiment, and visual-textual fidelity -- between that grounded description and the original text. A gated fusion module adaptively combines these signals with raw visual and textual features for classification. This technique generalizes beyond sarcasm to any task requiring detection of cross-modal contradictions or incongruity.

When to Use

When a user asks to detect sarcasm, irony, or satire in social media posts that contain both images and text
When building content moderation pipelines that need to flag misleading image-text pairs (e.g., a happy image paired with a bitter caption)
When the user wants to analyze whether an image and its caption semantically contradict each other
When implementing multimodal classifiers where direct cross-modal embedding alignment is unreliable because the image and text are only loosely connected
When the user needs to detect incongruity in memes, advertisements, or news content where the visual and textual messages diverge
When augmenting an existing text-only sarcasm detector with visual understanding

Key Technique

The core insight: Direct embedding comparison between images and text is brittle -- an image of a sunny beach paired with "What a wonderful Monday at work" won't show obvious embedding misalignment because neither modality is inherently negative. GDCNet solves this by introducing a semantic anchor: an objective, factual description of the image generated by a Multimodal LLM (the paper uses LLaVA-NeXT). This caption -- e.g., "A person sitting at a desk in a dimly lit office cubicle" -- becomes the stable reference against which the original text is compared. The discrepancy between "dimly lit office cubicle" and "wonderful Monday" is far more detectable than raw image-text embedding distance.

Three discrepancy channels capture different aspects of incongruity. (1) Semantic discrepancy measures how much the meaning of the generated caption diverges from the original text using encoded representations (e.g., BERT embeddings). (2) Sentiment discrepancy captures emotional contradictions -- the caption might describe a neutral or negative scene while the text expresses exaggerated positivity. (3) Visual-textual fidelity quantifies how faithfully the original text describes what is actually in the image, catching cases where the text simply ignores or misrepresents visual content.

Gated fusion prevents any single channel from dominating. A learnable gating mechanism assigns dynamic weights to each discrepancy channel and the raw visual/textual features, allowing the model to rely more on sentiment discrepancy for emotionally charged sarcasm but shift to semantic discrepancy for factual contradictions. This is implemented as gate functions with learnable parameters that produce a unified representation fed to a classification head.

Step-by-Step Workflow

Ingest the image-text pair. Load the image and its associated text (social media caption, meme text, article headline, etc.) as separate inputs. Validate that both modalities are present; fall back to text-only analysis if no image exists.
Generate an objective image description. Pass the image through a Multimodal LLM (LLaVA-NeXT, GPT-4V, or Claude's own vision) with a prompt like: "Describe the contents of this image in factual, objective terms. Do not interpret emotions, intent, or sarcasm. Focus on what is visually present: objects, people, actions, setting, colors, text overlays." This caption serves as the semantic anchor.
Encode all three text signals. Using a text encoder (BERT, RoBERTa, or sentence-transformers), produce embeddings for: (a) the original text, (b) the generated image description, and (c) optionally, any text extracted from the image via OCR. Store these as vectors for downstream comparison.
Compute semantic discrepancy. Calculate the cosine distance (or other divergence metric) between the embedding of the generated image description and the original text. High distance indicates the text says something semantically different from what the image depicts. Normalize this score to [0, 1].
Compute sentiment discrepancy. Run sentiment analysis on both the generated description and the original text (using a sentiment classifier or a simple positive/neutral/negative scorer). Compute the absolute difference in sentiment polarity scores. A positive description paired with negative text (or vice versa) yields a high sentiment discrepancy.
Compute visual-textual fidelity. Measure how much the original text actually describes the image content. Use the CLIP similarity score between the image and the original text, or compute entailment probability between the generated caption and the original text using an NLI model. Low fidelity means the text ignores or contradicts what's in the image.
Extract raw modality features. Encode the image through a vision encoder (CLIP ViT, ResNet) and the text through the text encoder. These raw representations capture information beyond what the discrepancy channels measure.
Apply gated fusion. Concatenate the three discrepancy scores with the raw visual and textual feature vectors. Pass through a gating network: g = sigmoid(W * [d_sem, d_sent, d_fid, v, t] + b), then compute the fused representation as h = g * [d_sem, d_sent, d_fid, v, t]. This lets the model learn which signals matter for each input.
Classify. Feed the fused representation through a classification head (linear layer + softmax) to produce sarcasm probability. Apply a threshold (default 0.5) for binary classification.
Return structured results. Output the classification label, confidence score, and the individual discrepancy scores so the user can interpret why the system flagged something as sarcastic (e.g., "High sentiment discrepancy: image shows a traffic jam but text says 'Love my commute!'").

Concrete Examples

Example 1: Social Media Sarcasm Detection Pipeline

User: "Build a sarcasm detection system for tweets that have images."

Approach:

Set up a Python pipeline with transformers, sentence-transformers, and clip libraries
For each tweet-image pair, generate an objective caption using a vision-language model
Compute the three discrepancy channels
Train a classifier on the MMSD2.0 dataset using the fused features

import torch
from sentence_transformers import SentenceTransformer
from transformers import pipeline

# Step 1: Load models
text_encoder = SentenceTransformer('all-MiniLM-L6-v2')
sentiment_analyzer = pipeline('sentiment-analysis', model='cardiffnlp/twitter-roberta-base-sentiment-latest')

# Step 2: Given an image caption (from MLLM) and original text
generated_caption = "A person sitting alone at a desk covered in paperwork in a gray office"
original_text = "Living my best life! #blessed #MondayMotivation"

# Step 3: Semantic discrepancy
emb_caption = text_encoder.encode(generated_caption)
emb_original = text_encoder.encode(original_text)
semantic_disc = 1 - torch.nn.functional.cosine_similarity(
    torch.tensor(emb_caption).unsqueeze(0),
    torch.tensor(emb_original).unsqueeze(0)
).item()  # 0.72 -- high divergence

# Step 4: Sentiment discrepancy
sent_caption = sentiment_analyzer(generated_caption)[0]   # {'label': 'negative', 'score': 0.68}
sent_original = sentiment_analyzer(original_text)[0]      # {'label': 'positive', 'score': 0.94}
sentiment_disc = abs(sent_caption['score'] * (-1 if 'neg' in sent_caption['label'] else 1)
                   - sent_original['score'] * (-1 if 'neg' in sent_original['label'] else 1))
# sentiment_disc = 1.62 (rescale to [0,1] -> 0.81)

# Step 5: Fidelity via NLI entailment
# nli_model.predict(premise=generated_caption, hypothesis=original_text)
# entailment_score = 0.08 -> fidelity_disc = 1 - 0.08 = 0.92

# Result: All three channels show high discrepancy -> sarcasm detected (confidence: 0.91)

Output:

Classification: SARCASTIC (confidence: 0.91)
Discrepancy breakdown:
  Semantic:  0.72 (text meaning differs significantly from image content)
  Sentiment: 0.81 (image conveys negativity, text conveys positivity)
  Fidelity:  0.92 (text does not describe image content)
Generated anchor: "A person sitting alone at a desk covered in paperwork in a gray office"

Example 2: Non-Sarcastic Content Verification

User: "Check if this product review with photo is genuine or sarcastic."

Image description (generated): "A close-up of a leather wallet with multiple card slots, brown color, on a wooden table" Original text: "Great quality leather wallet, fits all my cards perfectly. Highly recommend."

Discrepancy analysis:
  Semantic:  0.15 (text aligns well with image content)
  Sentiment: 0.08 (both mildly positive)
  Fidelity:  0.12 (text accurately describes shown product)

Classification: NOT SARCASTIC (confidence: 0.94)

Example 3: Meme Incongruity Analysis

User: "Analyze this meme for sarcasm -- it shows a house on fire with the caption 'This is fine.'"

Approach:

Generate anchor: "A cartoon dog sitting at a table inside a room engulfed in flames"
Compare against original text: "This is fine."

Discrepancy analysis:
  Semantic:  0.85 (extreme danger vs. casual acceptance)
  Sentiment: 0.78 (disaster scene vs. calm reassurance)
  Fidelity:  0.88 (text ignores the fire entirely)

Classification: SARCASTIC (confidence: 0.96)
Interpretation: The text deliberately understates the catastrophic visual content,
a classic sarcasm pattern where language contradicts observable reality.

Best Practices

Do use a factual, constrained prompt for image captioning. Instruct the MLLM to describe only observable facts -- "Describe objects, people, actions, and setting. Do not interpret mood or intent." Unconstrained captions introduce the same noise GDCNet is designed to avoid.
Do normalize all discrepancy scores to a consistent range (e.g., [0, 1]) before fusion, since raw cosine distances, sentiment differences, and entailment scores operate on different scales.
Do include the generated caption in your output so users can verify the anchor quality. A bad caption (hallucinated objects, missed key elements) will corrupt all three discrepancy channels.
Do use established text encoders fine-tuned on social media or informal text (e.g., twitter-roberta-base) when working with tweets or memes, since formal language models may misencode slang and hashtags.
Avoid relying on a single discrepancy channel. Sarcasm manifests differently -- some is purely sentiment-based ("Love this weather" over a storm), some is semantic ("My favorite place" over a prison), and some is about omission (text ignores the obvious). The gated fusion exists to handle this diversity.
Avoid using the same MLLM for both caption generation and sarcasm classification. The anchor must be independent and objective; if the model already "knows" the text is sarcastic, its caption may be contaminated with interpretive language.

Error Handling

Poor image caption quality: If the MLLM hallucinates objects or misses key visual elements, all discrepancy scores become unreliable. Mitigate by using a high-quality vision model, or generate multiple captions and take the consensus. Validate captions by checking CLIP similarity between the generated text and the original image.
Ambiguous sentiment: Neutral or mixed-sentiment text yields low sentiment discrepancy even when sarcasm is present. Fall back to semantic and fidelity channels. Consider using a sarcasm-aware sentiment model rather than a generic one.
OCR text in images: Memes and social media images often contain overlaid text. Extract this via OCR and include it as an additional input alongside the generated caption. Ignoring in-image text causes false negatives on text-heavy memes.
Domain shift: A model trained on English tweets may fail on formal news articles or non-English content. Retrain or fine-tune the discrepancy encoders on domain-matched data. At minimum, swap the text encoder for one suited to the target domain.
Short or generic text: Inputs like "Wow" or "Nice" have minimal semantic content, making discrepancy computation noisy. Flag these as low-confidence predictions and consider requiring a minimum text length for reliable analysis.

Limitations

Requires both modalities: GDCNet is fundamentally a multimodal approach. It cannot detect sarcasm in text-only content; use a dedicated text sarcasm model for that.
Caption bottleneck: The entire pipeline depends on the quality of the MLLM-generated image description. Failure modes in the vision model propagate to all three discrepancy channels.
Cultural and contextual sarcasm: Sarcasm that depends on shared cultural knowledge (e.g., referencing a specific event or inside joke) may not be captured by discrepancy alone, since the image and text may appear consistent at a surface level.
Computational cost: Running an MLLM for caption generation, a text encoder, a sentiment model, and a CLIP/NLI model per sample is resource-intensive. Not suitable for real-time, high-throughput applications without optimization (batching, model distillation, caching).
Positive sarcasm blind spot: When sarcasm involves understating something good (e.g., "Not bad" over an amazing view), sentiment discrepancy is low and the approach may miss it. This is an inherent challenge in discrepancy-based methods.

Reference

Paper: GDCNet: Generative Discrepancy Comparison Network for Multimodal Sarcasm Detection (ICASSP 2026) Key takeaway: Using MLLM-generated objective image descriptions as semantic anchors and computing three complementary discrepancy channels (semantic, sentiment, fidelity) outperforms direct cross-modal embedding comparison for sarcasm detection, achieving state-of-the-art on MMSD2.0.

GDCNet: Generative Discrepancy Comparison for Multimodal Sarcasm Detection

When to Use

When a user asks to detect sarcasm, irony, or satire in social media posts that contain both images and text
When building content moderation pipelines that need to flag misleading image-text pairs (e.g., a happy image paired with a bitter caption)
When the user wants to analyze whether an image and its caption semantically contradict each other
When implementing multimodal classifiers where direct cross-modal embedding alignment is unreliable because the image and text are only loosely connected
When the user needs to detect incongruity in memes, advertisements, or news content where the visual and textual messages diverge
When augmenting an existing text-only sarcasm detector with visual understanding

Key Technique

Step-by-Step Workflow

Ingest the image-text pair. Load the image and its associated text (social media caption, meme text, article headline, etc.) as separate inputs. Validate that both modalities are present; fall back to text-only analysis if no image exists.
Generate an objective image description. Pass the image through a Multimodal LLM (LLaVA-NeXT, GPT-4V, or Claude's own vision) with a prompt like: "Describe the contents of this image in factual, objective terms. Do not interpret emotions, intent, or sarcasm. Focus on what is visually present: objects, people, actions, setting, colors, text overlays." This caption serves as the semantic anchor.
Encode all three text signals. Using a text encoder (BERT, RoBERTa, or sentence-transformers), produce embeddings for: (a) the original text, (b) the generated image description, and (c) optionally, any text extracted from the image via OCR. Store these as vectors for downstream comparison.
Compute semantic discrepancy. Calculate the cosine distance (or other divergence metric) between the embedding of the generated image description and the original text. High distance indicates the text says something semantically different from what the image depicts. Normalize this score to [0, 1].
Compute sentiment discrepancy. Run sentiment analysis on both the generated description and the original text (using a sentiment classifier or a simple positive/neutral/negative scorer). Compute the absolute difference in sentiment polarity scores. A positive description paired with negative text (or vice versa) yields a high sentiment discrepancy.
Compute visual-textual fidelity. Measure how much the original text actually describes the image content. Use the CLIP similarity score between the image and the original text, or compute entailment probability between the generated caption and the original text using an NLI model. Low fidelity means the text ignores or contradicts what's in the image.
Extract raw modality features. Encode the image through a vision encoder (CLIP ViT, ResNet) and the text through the text encoder. These raw representations capture information beyond what the discrepancy channels measure.
Apply gated fusion. Concatenate the three discrepancy scores with the raw visual and textual feature vectors. Pass through a gating network: g = sigmoid(W * [d_sem, d_sent, d_fid, v, t] + b), then compute the fused representation as h = g * [d_sem, d_sent, d_fid, v, t]. This lets the model learn which signals matter for each input.
Classify. Feed the fused representation through a classification head (linear layer + softmax) to produce sarcasm probability. Apply a threshold (default 0.5) for binary classification.
Return structured results. Output the classification label, confidence score, and the individual discrepancy scores so the user can interpret why the system flagged something as sarcastic (e.g., "High sentiment discrepancy: image shows a traffic jam but text says 'Love my commute!'").

Concrete Examples

Example 1: Social Media Sarcasm Detection Pipeline

User: "Build a sarcasm detection system for tweets that have images."

Approach:

Set up a Python pipeline with transformers, sentence-transformers, and clip libraries
For each tweet-image pair, generate an objective caption using a vision-language model
Compute the three discrepancy channels
Train a classifier on the MMSD2.0 dataset using the fused features

import torch
from sentence_transformers import SentenceTransformer
from transformers import pipeline

# Step 1: Load models
text_encoder = SentenceTransformer('all-MiniLM-L6-v2')
sentiment_analyzer = pipeline('sentiment-analysis', model='cardiffnlp/twitter-roberta-base-sentiment-latest')

# Step 2: Given an image caption (from MLLM) and original text
generated_caption = "A person sitting alone at a desk covered in paperwork in a gray office"
original_text = "Living my best life! #blessed #MondayMotivation"

# Step 3: Semantic discrepancy
emb_caption = text_encoder.encode(generated_caption)
emb_original = text_encoder.encode(original_text)
semantic_disc = 1 - torch.nn.functional.cosine_similarity(
    torch.tensor(emb_caption).unsqueeze(0),
    torch.tensor(emb_original).unsqueeze(0)
).item()  # 0.72 -- high divergence

# Step 4: Sentiment discrepancy
sent_caption = sentiment_analyzer(generated_caption)[0]   # {'label': 'negative', 'score': 0.68}
sent_original = sentiment_analyzer(original_text)[0]      # {'label': 'positive', 'score': 0.94}
sentiment_disc = abs(sent_caption['score'] * (-1 if 'neg' in sent_caption['label'] else 1)
                   - sent_original['score'] * (-1 if 'neg' in sent_original['label'] else 1))
# sentiment_disc = 1.62 (rescale to [0,1] -> 0.81)

# Step 5: Fidelity via NLI entailment
# nli_model.predict(premise=generated_caption, hypothesis=original_text)
# entailment_score = 0.08 -> fidelity_disc = 1 - 0.08 = 0.92

# Result: All three channels show high discrepancy -> sarcasm detected (confidence: 0.91)

Output:

Classification: SARCASTIC (confidence: 0.91)
Discrepancy breakdown:
  Semantic:  0.72 (text meaning differs significantly from image content)
  Sentiment: 0.81 (image conveys negativity, text conveys positivity)
  Fidelity:  0.92 (text does not describe image content)
Generated anchor: "A person sitting alone at a desk covered in paperwork in a gray office"

Example 2: Non-Sarcastic Content Verification

User: "Check if this product review with photo is genuine or sarcastic."

Discrepancy analysis:
  Semantic:  0.15 (text aligns well with image content)
  Sentiment: 0.08 (both mildly positive)
  Fidelity:  0.12 (text accurately describes shown product)

Classification: NOT SARCASTIC (confidence: 0.94)

Example 3: Meme Incongruity Analysis

User: "Analyze this meme for sarcasm -- it shows a house on fire with the caption 'This is fine.'"

Approach:

Generate anchor: "A cartoon dog sitting at a table inside a room engulfed in flames"
Compare against original text: "This is fine."

Discrepancy analysis:
  Semantic:  0.85 (extreme danger vs. casual acceptance)
  Sentiment: 0.78 (disaster scene vs. calm reassurance)
  Fidelity:  0.88 (text ignores the fire entirely)

Classification: SARCASTIC (confidence: 0.96)
Interpretation: The text deliberately understates the catastrophic visual content,
a classic sarcasm pattern where language contradicts observable reality.

Best Practices

Do use a factual, constrained prompt for image captioning. Instruct the MLLM to describe only observable facts -- "Describe objects, people, actions, and setting. Do not interpret mood or intent." Unconstrained captions introduce the same noise GDCNet is designed to avoid.
Do normalize all discrepancy scores to a consistent range (e.g., [0, 1]) before fusion, since raw cosine distances, sentiment differences, and entailment scores operate on different scales.
Do include the generated caption in your output so users can verify the anchor quality. A bad caption (hallucinated objects, missed key elements) will corrupt all three discrepancy channels.
Do use established text encoders fine-tuned on social media or informal text (e.g., twitter-roberta-base) when working with tweets or memes, since formal language models may misencode slang and hashtags.
Avoid relying on a single discrepancy channel. Sarcasm manifests differently -- some is purely sentiment-based ("Love this weather" over a storm), some is semantic ("My favorite place" over a prison), and some is about omission (text ignores the obvious). The gated fusion exists to handle this diversity.
Avoid using the same MLLM for both caption generation and sarcasm classification. The anchor must be independent and objective; if the model already "knows" the text is sarcastic, its caption may be contaminated with interpretive language.

Error Handling

Poor image caption quality: If the MLLM hallucinates objects or misses key visual elements, all discrepancy scores become unreliable. Mitigate by using a high-quality vision model, or generate multiple captions and take the consensus. Validate captions by checking CLIP similarity between the generated text and the original image.
Ambiguous sentiment: Neutral or mixed-sentiment text yields low sentiment discrepancy even when sarcasm is present. Fall back to semantic and fidelity channels. Consider using a sarcasm-aware sentiment model rather than a generic one.
OCR text in images: Memes and social media images often contain overlaid text. Extract this via OCR and include it as an additional input alongside the generated caption. Ignoring in-image text causes false negatives on text-heavy memes.
Domain shift: A model trained on English tweets may fail on formal news articles or non-English content. Retrain or fine-tune the discrepancy encoders on domain-matched data. At minimum, swap the text encoder for one suited to the target domain.
Short or generic text: Inputs like "Wow" or "Nice" have minimal semantic content, making discrepancy computation noisy. Flag these as low-confidence predictions and consider requiring a minimum text length for reliable analysis.

Limitations

Requires both modalities: GDCNet is fundamentally a multimodal approach. It cannot detect sarcasm in text-only content; use a dedicated text sarcasm model for that.
Caption bottleneck: The entire pipeline depends on the quality of the MLLM-generated image description. Failure modes in the vision model propagate to all three discrepancy channels.
Cultural and contextual sarcasm: Sarcasm that depends on shared cultural knowledge (e.g., referencing a specific event or inside joke) may not be captured by discrepancy alone, since the image and text may appear consistent at a surface level.
Computational cost: Running an MLLM for caption generation, a text encoder, a sentiment model, and a CLIP/NLI model per sample is resource-intensive. Not suitable for real-time, high-throughput applications without optimization (batching, model distillation, caching).
Positive sarcasm blind spot: When sarcasm involves understating something good (e.g., "Not bad" over an amazing view), sentiment discrepancy is low and the approach may miss it. This is an inherent challenge in discrepancy-based methods.

Adoption

ndpvt-web/gdcnet-generative-discrepancy-comparison

$ install --global

Security Scan Results

SKILL.md

GDCNet: Generative Discrepancy Comparison for Multimodal Sarcasm Detection

When to Use

Key Technique

Step-by-Step Workflow

Concrete Examples

Best Practices

Error Handling

Limitations

Reference

Related Skills

ndpvt-web/gradingattack-attacking-short-answer

ndpvt-web/gisa-benchmark-general-information-seeking

ndpvt-web/gflowpo-generative-flow-network

ndpvt-web/generative-ontology-structured-knowledge

ndpvt-web/gdcnet-generative-discrepancy-comparison

$ install --global

Security Scan Results

SKILL.md

GDCNet: Generative Discrepancy Comparison for Multimodal Sarcasm Detection

When to Use

Key Technique

Step-by-Step Workflow

Concrete Examples

Best Practices

Error Handling

Limitations

Reference

Related Skills

ndpvt-web/gradingattack-attacking-short-answer

ndpvt-web/gisa-benchmark-general-information-seeking

ndpvt-web/gflowpo-generative-flow-network

ndpvt-web/generative-ontology-structured-knowledge