3D Medical Vision-Language Enhancement
Synchronizing overview from sheet...
Key Achievements
- Synchronizing achievements from sheet...
Milestones
- Synchronizing milestones from sheet...
Abstract
Interpreting volumetric CT with vision-language models (VLMs) demands alignment of long-range spatial-temporal evidence with radiology text under tight memory budgets. In this setting, Med3DVLM, a 3D vision encoder coupled to a 7B decoder, reports 79.95 percent closed-ended accuracy and 36.76 METEOR on M3D.
Yet contemporary VLM attention often diffuses, lighting up many non-diagnostic regions instead of truly salient ones. We propose slice-wise visual-instruction prompting: on every axial slice of the 3D volume, a sub-voxel thin, colored contour traces the anatomy referenced by the question, turning the image itself into a focus cue.
On RadGenome-ChestCT and PMC-VQA, Qwen variants (0.5B/1.5B/3B) with these prompts perform on par with a prompt-free Qwen-7B while cutting GPU memory. Moreover, prompt-guided fine-tuning further lifts closed-ended accuracy and improves open-ended VQA on BLEU-4, ROUGE-L, and METEOR.