MedObvious

Why this benchmark matters

Before a model diagnoses, it should know what it is looking at.

In clinical practice, interpretation begins with input verification: modality, anatomy, viewpoint, and basic image integrity must be correct before reasoning about pathology. MedObvious measures this pre-diagnostic gatekeeping ability directly.

This is especially important for multi-image and agentic workflows, where models operate over multi-view ultrasound, CT/MRI series, or viewer layouts. A single inconsistent panel can invalidate downstream reasoning.

🏥 Clinical safety

Wrong body part, flipped images, anatomy/viewpoint mismatches, and device-related cues that should be caught before any diagnosis.

🔬 Visual grounding

Synthetic inconsistencies test whether decisions are anchored in visual evidence rather than language priors or report-style completion.

⚠️ Negative controls

705 tasks contain no outlier, directly measuring false alarm rates when all panels are internally consistent.

Interactive Teaser

A guided walkthrough of MedObvious

Motivation

Are Medical VLMs Truly Reasoning?

Current Vision-Language Models boast impressive medical QA scores. However, they may be memorizing textbook patterns rather than performing genuine diagnostic deduction.

🤔 🏥 🤖

The Core Problem

Standard Benchmarks Fail to Test Visual Triage

Existing benchmarks test factual retrieval but do not test active comparative reasoning or penalize severe hallucinations on clean inputs.

1

Text Bias

Models read text overlays instead of analyzing the actual visual content of the scan.

2

Rote Memorization

Models guess based on statistical probability rather than examining the image systematically.

Our Solution

The Clinical Odd-One-Out Test

We force VLMs to systematically analyze dense grids of clinical scans and identify the single anomalous outlier — testing pre-diagnostic visual sanity.

Easy Case

Modality Mismatch (2×2)

A 2×2 grid with 3 CT scans and 1 MRI scan at bottom-left. Should be trivial for any expert — most models still fail.

Qwen2.5-VL ✓"C (bottom-left)"

LLaVA-1.5 ✗"A (top-left)"

Pixtral-12B ✗"D (bottom-right)"

Hard Case

Dense 3×3 Grid

8 CT scans + 1 MRI at position F (center-right). Scaling to 9 panels overwhelms every model tested — all three fail.

Qwen2.5-VL ✗"D (center-left)"

Qwen3-VL ✗"A (top-left)"

Pixtral-12B ✗"A (top-left)"

Clinical Triage

Hardware / Device Detection

One scan contains surgical hardware at top-right. The others are clean chest X-rays. Every single model gets this wrong.

Qwen2.5-VL ✗"C (bottom-left)"

Qwen3-VL ✗"C (bottom-left)"

LLaVA-1.5 ✗"A (top-left)"

Key Finding

Severe Hallucinations on Negative Controls

When given a grid with no outlier present, most VLMs confidently hallucinate one. Human experts score 95.7%.

Best VLM on negatives

70.7%

Qwen2.5-VL-7B

Benchmark Design

Set-level consistency as a controlled testbed

MedObvious abstracts multi-view clinical workflows into small grids where the model must identify an outlier — or correctly state that none exists.

01

Five progressive tiers

From basic modality mismatches (T1) to subtle clinical pathology and hardware detection (T5).

02

Five evaluation protocols

Detection MCQ/Open, Referring MCQ/Open, and Visual Referring test localization, description, and verification.

03

Explicit negative controls

705 tasks contain no outlier, directly measuring false alarm rates that are critical for safe deployment.

Five Tiers

Progressively harder sanity-checking tasks

Each tier introduces new complexity. Real examples from the benchmark are shown alongside each tier description.

T1: Foundation

Basic modality mismatches in 2×2 grids. E.g., one MRI scan among CT scans.

440 tasks (275 positive + 165 negative)

T2: Diversity

Broader modality pool with finer intra-class appearance variability in 2×2 grids.

480 tasks (300 positive + 180 negative)

T3: Scaling

Dense 3×3 grids with 8 distractors. Systematic comparison is essential.

360 tasks (225 positive + 135 negative)

T4: Semantics

Anatomy and viewpoint mismatches. E.g., one abdomen CT among chest X-rays.

320 tasks (200 positive + 120 negative)

T5: Triage

High-saliency clinical failures: surgical hardware, fractures, gross pathology.

280 tasks (175 positive + 105 negative)

Results

Current VLMs remain unreliable gatekeepers

Across 17 VLMs spanning general, medical, and proprietary families, performance is strikingly uneven.

Best open-source63.2%

Qwen2.5-VL-7B

Best medical56.6%

Lingshu-7B

Best proprietary55.5%

Gemini-2.0-Flash

Human expert88.4%

Across all tasks

Tasks A–E correspond to Detection MCQ, Detection Open, Referring MCQ, Referring Open, and Visual Referring. Pos(+)/Neg(−) = accuracy on positive/negative samples.

Model	Task-A	Task-B	Task-C	Task-D	Task-E	Pos(+)	Neg(−)	Avg
General Open-Source VLMs
LLaVA-1.5-7B	40.4	22.3	22.1	35.7	50.0	37.5	31.9	34.3
Qwen2-VL-7B	32.3	49.5	72.7	25.1	53.1	56.3	28.7	45.4
Qwen2.5-VL-7B	58.7	82.1	75.3	29.7	65.1	60.9	70.7	63.2
Qwen3-VL-8B	31.2	32.7	80.8	38.7	52.9	68.9	2.9	44.0
InternVL2.5-8B	56.3	56.3	69.7	26.8	51.7	59.7	42.5	51.9
InternVL3-8B	38.5	43.6	80.4	27.6	50.2	64.6	16.6	45.9
Pixtral-12B	31.0	22.5	76.6	26.3	51.9	59.5	5.3	39.0
Medical Open-Source VLMs
LLaVA-Med-7B	10.0	36.8	21.2	23.4	50.0	37.1	17.5	28.0
Fleming-8B	26.8	23.8	78.3	23.8	50.0	57.4	5.3	37.9
MedGemma1.5-4B-IT	23.6	86.1	43.4	19.5	66.6	44.6	64.1	49.7
Lingshu-7B	39.3	78.5	79.5	26.8	61.9	66.8	43.8	56.6
Proprietary VLMs
Gemini-2.0-Flash	54.2	42.7	85.9	35.7	69.3	75.4	25.6	55.5
Gemini-2.5-Flash	67.2	45.5	80.4	31.9	55.9	74.1	26.3	54.4
GPT-4o	47.2	50.4	62.8	26.3	61.7	68.0	22.7	48.4
GPT-4.1-nano	25.3	16.8	26.3	18.3	55.1	34.9	21.4	28.3
GPT-4.1-mini	41.9	32.9	53.1	29.3	64.2	64.0	13.6	42.7
GPT-5-nano	43.4	41.7	82.5	28.9	63.4	73.1	14.8	49.6
Human expert	82.1	85.7	82.1	90.9	92.9	89.4	95.7	88.4

Per-tier overall accuracy (%). Tiers increase in difficulty from T1 (foundation) to T5 (triage).

Model	T1	T2	T3	T4	T5	All
General Open-Source VLMs
LLaVA-1.5-7B	35.2	40.4	21.6	42.5	36.7	35.2
Qwen2-VL-7B	50.9	47.2	37.2	52.8	39.6	45.5
Qwen2.5-VL-7B	68.8	67.2	50.5	84.0	49.2	63.9
Qwen3-VL-8B	47.2	48.9	34.7	53.4	32.8	43.4
InternVL2.5-8B	58.8	56.0	33.0	67.2	49.3	52.8
InternVL3-8B	50.4	50.4	37.2	57.1	33.9	45.8
Pixtral-12B	45.0	41.4	26.3	43.9	33.8	38.7
Medical Open-Source VLMs
LLaVA-Med-7B	32.7	32.5	28.8	26.8	25.0	29.1
Fleming-8B	41.8	43.9	29.1	39.0	31.4	37.0
MedGemma1.5-4B-IT	59.7	53.7	37.2	53.7	53.5	51.5
Lingshu-8B	62.9	64.7	48.8	63.1	46.0	57.1
Proprietary VLMs
Gemini-2.0-Flash	59.3	61.0	56.3	66.8	34.6	55.6
Gemini-2.5-Flash	59.0	62.7	55.2	62.1	35.0	54.8
GPT-4o	54.3	56.0	45.2	59.6	34.6	49.9
GPT-4.1-nano	29.3	31.8	20.2	21.8	37.5	30.1
GPT-4.1-mini	47.7	52.2	36.1	50.9	33.5	44.1
GPT-5-nano	52.5	56.6	46.6	61.2	33.2	50.0

🚨 Catastrophic negative-control failure

When no outlier exists, most VLMs hallucinate one. Qwen3-VL-8B: 2.9% on negatives. Humans: 95.7%.

📉 T3 is the universal breaking point

Scaling from 2×2 to 3×3 causes steep drops. Qwen2.5-VL drops from 68.8% (T1) to 50.5% (T3).

🔬 Medical fine-tuning ≠ safety

LLaVA-Med-7B averages only 28.0%, worse than several general-purpose models.

🎯 MCQ inflates capability

Gemini-2.0-Flash: Task-C 85.9% vs Task-D 35.7% — a 50-point gap between MCQ and open formats.

✅ Scale advantages are modest

GPT-5-nano (49.6%) nearly matches GPT-4o (48.4%).

⚠️ T5 triage remains unsolved

Even the best model (MedGemma: 53.5%) fails nearly half the time on clinical hardware detection.

Qualitative Examples

Side-by-side: query grids and model predictions

Each example shows the input grid with its protocol-specific query alongside predictions from representative models. Visual Referring examples use images with a red bounding box highlighting a specific panel.

Detection MCQ

Query: "You are reviewing a 2×2 grid of medical scans. One scan is the clinical outlier — does it differ in modality, anatomy, or pathology from the others? Which position contains the outlier? (A–E)"
Ground truth: C (bottom-left) — MRI scan among CT scans.

Qwen2.5-VL-7B

✓ Correct Predicted: C (bottom-left)

Correctly identifies the MRI scan as the modality outlier.

LLaVA-1.5-7B

✗ Wrong Predicted: A (top-left)

Exhibits strong position bias — always defaults to A regardless of content.

Pixtral-12B

✗ Wrong Predicted: D (bottom-right)

Picks the wrong quadrant despite the correct region.

Detection Open

Query: "You are a medical imaging expert reviewing a 3×3 grid. Identify the outlier position — does it differ in modality, anatomy, or pathology from the others?"
Ground truth: The MRI scan among 8 CT scans. Explain: different imaging modality.

Qwen2.5-VL-7B

✗ Wrong Points to incorrect position

The dense 3×3 layout overwhelms systematic comparison — picks wrong column.

Qwen3-VL-8B

✗ Wrong Points to top-left

Falls back to default position under uncertainty in the dense grid.

Pixtral-12B

✗ Wrong Points to top-left

3×3 grids universally expose shallow pattern matching.

Referring MCQ

Query: "In this 2×2 grid, one scan contains a clinical anomaly (abdomen CT scan) — does it differ in modality, anatomy, or pathology from the others? Which position is it? (A–E)"
Ground truth: D (bottom-right) — abdomen CT among chest X-rays.

Qwen2.5-VL-7B

✓ Correct Predicted: D

Referring MCQ gives a text hint about the anomaly type, making it easier to verify.

Qwen3-VL-8B

✓ Correct Predicted: D

Cross-anatomy mismatches are easier when the anomaly type is described.

Pixtral-12B

✓ Correct Predicted: D

Referring MCQ typically achieves higher accuracy than detection protocols.

Referring Open

Query: "In this 2×2 grid, one image contains a clinical anomaly — does it differ in modality, anatomy, or pathology from the others? State its position."
Ground truth: Top-left — ultrasound scan among MRI scans.

InternVL2.5-8B

✗ Wrong Incorrectly describes the anomaly

Open-ended generation reveals weaker grounding than constrained MCQ options.

LLaVA-Med-7B

✗ Wrong Points to wrong position

Medical fine-tuning does not help — position bias dominates in open format.

GPT-4.1-nano

✗ Wrong Confuses modality labels

Smallest proprietary model struggles with open-ended clinical reasoning.

Example 5 — Visual Referring with red box

Visual Referring (Positive)

Query: "The red box highlights one scan. Is this highlighted scan the clinical outlier — does it differ in modality, anatomy, or pathology from the others? Answer yes or no."
Ground truth: Yes — the highlighted scan is an MRI among CT scans.

Qwen2.5-VL-7B

✓ Correct "Yes, the highlighted scan is MRI."

When pointed directly at the outlier, the model can confirm differing modality.

LLaVA-1.5-7B

✗ Wrong "No, all images appear consistent."

Fails to recognize the modality difference even when the outlier is highlighted.

Fleming-8B

✗ Wrong "No"

Medical VLM incorrectly denies an obvious modality mismatch.

Visual Referring (Negative)

Query: "The red box highlights one scan. Is this highlighted scan the clinical outlier — does it differ in modality, anatomy, or pathology from the others? Answer yes or no."
Ground truth: No — the highlighted scan is NOT the outlier. All highlighted images show MRI.

Qwen3-VL-8B

✗ Hallucinated "Yes, it differs."

Hallucinates an anomaly in the highlighted panel despite it being consistent.

Pixtral-12B

✗ Hallucinated "Yes"

"Always-find-something" bias is triggered by the red box visual cue.

MedGemma1.5-4B-IT

✓ Correct "No, the highlighted scan is consistent."

Best negative control accuracy among medical VLMs — capable of restraint.

Visual Referring (3×3)

Query: "The red box in this 3×3 grid highlights one scan. Is this highlighted scan the clinical outlier — does it differ in modality, anatomy, or pathology from the others?"
Ground truth: Yes — the highlighted scan is an MRI among CT scans.

InternVL3-8B

✗ Wrong "No, the highlighted image is consistent."

Dense 3×3 grids make it harder to compare even when the outlier is highlighted.

GPT-4.1-mini

✗ Wrong "No"

Smaller proprietary models struggle with visual referring on larger grids.

GPT-5-nano

✗ Wrong "No, appears consistent."

Even the latest reasoning models fail on 3×3 visual referring tasks.

Visual Referring (Negative)

Query: "The red box in this 3×3 medical image grid highlights one scan. Is this highlighted scan the clinical outlier — does it differ in modality, anatomy, or pathology from the others? Answer yes or no."
Ground truth: No — All images show CT scan. No outlier is present.

Qwen3-VL-8B

✗ Hallucinated "Yes"

Incorrectly flags a perfectly consistent 3×3 negative control as an outlier.

Pixtral-12B

✗ Hallucinated "Yes. The highlighted scan in the red box is an outlier as it shows a chest X-ray with a visible abnormality..."

Hallucinates a detailed physical abnormality despite the uniform grid.

LLaVA-1.5-7B

✗ Hallucinated "Yes"

Shows strong "always-find-something" bias triggered by the bounding box.

Negative Control (3×3)

Query: "If one scan is the outlier — does it differ in modality, anatomy, or pathology from the others? — identify it. If all scans appear consistent, select 'None'. (A–J)"
Ground truth: J (None) — all images show MRI scans. No outlier exists.

Gemini-2.0-Flash

✗ Hallucinated Picks a panel as outlier

Fabricates a difference in a perfectly consistent 3×3 grid.

GPT-4o

✗ Hallucinated Picks a panel as outlier

Generates plausible-sounding but entirely fabricated clinical reasoning.

Qwen2.5-VL-7B

✓ Correct Predicted: J (None)

One of few models capable of correctly suppressing false alarms on 3×3 grids.

Visual Referring (Positive)

Query: "The red box highlights one scan. Is this the clinical outlier — does it differ in modality, anatomy, or pathology from the others?"
Ground truth: Yes — the highlighted scan is an abdomen CT among chest X-rays (anatomy mismatch).

Gemini-2.5-Flash

✓ Correct "Yes, this is an abdominal scan."

Proprietary models perform best on visual referring with clear anatomy differences.

Qwen2.5-VL-7B

✓ Correct "Yes"

Cross-anatomy mismatches are well handled when the panel is directly highlighted.

LLaVA-Med-7B

✗ Wrong "No"

Medical fine-tuning on LLaVA produces the worst accuracy (28.0%) across all protocols.

Resources

arXiv, Data & Code

📄 arXiv Pre-print

Full benchmark description, methodology, and evaluation results.

Open →

🎬 Teaser Video

Download the short animated teaser walkthrough.

Play →

Citation

@inproceedings{medobvious2026,
  title   = {MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage},
  author  = {Ufaq Khan and Umair Nawaz and L D M S S Teja and Numaan Saeed and Muhammad Bilal and Yutong Xie and Mohammad Yaqub and Muhammad Haris Khan},
  journal = {arXiv preprint arXiv:2603.22286},
  year    = {2026}
}

MedObvious

🚀 Benchmark Will Be Released Soon!

"Which image does not belong
among the candidates?"

Before a model diagnoses, it should know what it is looking at.

🏥 Clinical safety

🔬 Visual grounding

⚠️ Negative controls

A guided walkthrough of MedObvious

Are Medical VLMs Truly Reasoning?

Standard Benchmarks Fail to Test Visual Triage

Text Bias

Rote Memorization

MedObvious

The Clinical Odd-One-Out Test

Modality Mismatch (2×2)

Dense 3×3 Grid

Hardware / Device Detection

Severe Hallucinations on Negative Controls

Set-level consistency as a controlled testbed

Five progressive tiers

Five evaluation protocols

Explicit negative controls

Progressively harder sanity-checking tasks

T1: Foundation

T2: Diversity

T3: Scaling

T4: Semantics

T5: Triage

Current VLMs remain unreliable gatekeepers

🚨 Catastrophic negative-control failure

📉 T3 is the universal breaking point

🔬 Medical fine-tuning ≠ safety

🎯 MCQ inflates capability

✅ Scale advantages are modest

⚠️ T5 triage remains unsolved

Side-by-side: query grids and model predictions

arXiv, Data & Code

📄 arXiv Pre-print

🎬 Teaser Video

Citation