3D Medical Vision-Language Enhancement

Team MULTIMODAL

Models

PDF

Overview

Synchronizing overview from sheet...

Key Achievements

Synchronizing achievements from sheet...

Milestones

Synchronizing milestones from sheet...

Abstract

Interpreting volumetric CT with vision-language models (VLMs) demands alignment of long-range spatial-temporal evidence with radiology text under tight memory budgets. In this setting, Med3DVLM, a 3D vision encoder coupled to a 7B decoder, reports 79.95 percent closed-ended accuracy and 36.76 METEOR on M3D.

Yet contemporary VLM attention often diffuses, lighting up many non-diagnostic regions instead of truly salient ones. We propose slice-wise visual-instruction prompting: on every axial slice of the 3D volume, a sub-voxel thin, colored contour traces the anatomy referenced by the question, turning the image itself into a focus cue.

On RadGenome-ChestCT and PMC-VQA, Qwen variants (0.5B/1.5B/3B) with these prompts perform on par with a prompt-free Qwen-7B while cutting GPU memory. Moreover, prompt-guided fine-tuning further lifts closed-ended accuracy and improves open-ended VQA on BLEU-4, ROUGE-L, and METEOR.

Deep Learning Medical Imaging Computer Vision 3D Analysis Vision-Language Model

Sooyong Kim

Student Researcher

Main contact for this project.