Burgess James, Nirschl Jeffrey J, Bravo-Sánchez Laura, Lozano Alejandro, Gupte Sanket Rajan, Galaz-Montoya Jesus G, Zhang Yuhui, Su Yuchang, Bhowmik Disha, Coman Zachary, Hasan Sarina M, Johannesson Alexandra, Leineweber William D, Nair Malvika G, Yarlagadda Ridhi, Zuraski Connor, Chiu Wah, Cohen Sarah, Hansen Jan N, Leonetti Manuel D, Liu Chad, Lundberg Emma, Yeung-Levy Serena
Stanford University.
Tsinghua University.
ArXiv. 2025 Mar 17:arXiv:2503.13399v1.
Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college-level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimodal reasoning needed for scientific discovery. To bridge this gap, we introduce MicroVQA, a visual-question answering (VQA) benchmark designed to assess three reasoning capabilities vital in research workflows: expert image understanding, hypothesis generation, and experiment proposal. MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities, ensuring VQA samples represent real scientific practice. In constructing the benchmark, we find that standard MCQ generation methods induce language shortcuts, motivating a new two-stage pipeline: an optimized LLM prompt structures question-answer pairs into MCQs; then, an agent-based 'RefineBot' updates them to remove shortcuts. Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53%; models with smaller LLMs only slightly underperform top models, suggesting that language-based reasoning is less challenging than multimodal reasoning; and tuning with scientific articles enhances performance. Expert analysis of chain-of-thought responses shows that perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors. These insights highlight the challenges in multimodal scientific reasoning, showing MicroVQA is a valuable resource advancing AI-driven biomedical research. MicroVQA is available here, project here.
科学研究需要对多模态数据进行复杂的推理,这一挑战在生物学领域尤为普遍。尽管用于人工智能辅助研究的多模态大语言模型(MLLMs)最近取得了进展,但现有的多模态推理基准测试仅针对大学水平的难度,而研究水平的基准测试则强调较低层次的感知,无法满足科学发现所需的复杂多模态推理。为了弥合这一差距,我们引入了MicroVQA,这是一个视觉问答(VQA)基准测试,旨在评估研究工作流程中至关重要的三种推理能力:专家图像理解、假设生成和实验方案。MicroVQA由生物学专家针对不同显微镜模式精心策划的1042个多项选择题(MCQs)组成,确保VQA样本代表真实的科学实践。在构建基准测试时,我们发现标准的MCQ生成方法会导致语言捷径,因此催生了一种新的两阶段流程:优化的大语言模型提示将问答对构建成MCQs;然后,基于代理的“RefineBot”对其进行更新以消除捷径。对最先进的多模态大语言模型进行基准测试,结果显示最高性能为53%;使用较小大语言模型的模型仅略逊于顶级模型,这表明基于语言的推理比多模态推理的挑战性更小;并且用科学文章进行微调可以提高性能。对思维链回答的专家分析表明,感知错误最为常见,其次是知识错误,然后是过度概括错误。这些见解凸显了多模态科学推理中的挑战,表明MicroVQA是推动人工智能驱动的生物医学研究的宝贵资源。MicroVQA可在此处获取,项目在此处。