• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

MicroVQA:基于显微镜的科学研究的多模态推理基准

MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research.

作者信息

Burgess James, Nirschl Jeffrey J, Bravo-Sánchez Laura, Lozano Alejandro, Gupte Sanket Rajan, Galaz-Montoya Jesus G, Zhang Yuhui, Su Yuchang, Bhowmik Disha, Coman Zachary, Hasan Sarina M, Johannesson Alexandra, Leineweber William D, Nair Malvika G, Yarlagadda Ridhi, Zuraski Connor, Chiu Wah, Cohen Sarah, Hansen Jan N, Leonetti Manuel D, Liu Chad, Lundberg Emma, Yeung-Levy Serena

机构信息

Stanford University.

Tsinghua University.

出版信息

ArXiv. 2025 Mar 17:arXiv:2503.13399v1.

PMID:40166749
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11957224/
Abstract

Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college-level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimodal reasoning needed for scientific discovery. To bridge this gap, we introduce MicroVQA, a visual-question answering (VQA) benchmark designed to assess three reasoning capabilities vital in research workflows: expert image understanding, hypothesis generation, and experiment proposal. MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities, ensuring VQA samples represent real scientific practice. In constructing the benchmark, we find that standard MCQ generation methods induce language shortcuts, motivating a new two-stage pipeline: an optimized LLM prompt structures question-answer pairs into MCQs; then, an agent-based 'RefineBot' updates them to remove shortcuts. Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53%; models with smaller LLMs only slightly underperform top models, suggesting that language-based reasoning is less challenging than multimodal reasoning; and tuning with scientific articles enhances performance. Expert analysis of chain-of-thought responses shows that perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors. These insights highlight the challenges in multimodal scientific reasoning, showing MicroVQA is a valuable resource advancing AI-driven biomedical research. MicroVQA is available here, project here.

摘要

科学研究需要对多模态数据进行复杂的推理,这一挑战在生物学领域尤为普遍。尽管用于人工智能辅助研究的多模态大语言模型(MLLMs)最近取得了进展,但现有的多模态推理基准测试仅针对大学水平的难度,而研究水平的基准测试则强调较低层次的感知,无法满足科学发现所需的复杂多模态推理。为了弥合这一差距,我们引入了MicroVQA,这是一个视觉问答(VQA)基准测试,旨在评估研究工作流程中至关重要的三种推理能力:专家图像理解、假设生成和实验方案。MicroVQA由生物学专家针对不同显微镜模式精心策划的1042个多项选择题(MCQs)组成,确保VQA样本代表真实的科学实践。在构建基准测试时,我们发现标准的MCQ生成方法会导致语言捷径,因此催生了一种新的两阶段流程:优化的大语言模型提示将问答对构建成MCQs;然后,基于代理的“RefineBot”对其进行更新以消除捷径。对最先进的多模态大语言模型进行基准测试,结果显示最高性能为53%;使用较小大语言模型的模型仅略逊于顶级模型,这表明基于语言的推理比多模态推理的挑战性更小;并且用科学文章进行微调可以提高性能。对思维链回答的专家分析表明,感知错误最为常见,其次是知识错误,然后是过度概括错误。这些见解凸显了多模态科学推理中的挑战,表明MicroVQA是推动人工智能驱动的生物医学研究的宝贵资源。MicroVQA可在此处获取,项目在此处。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/616f/11957224/9fae3f882f1e/nihpp-2503.13399v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/616f/11957224/4ba7cb12901c/nihpp-2503.13399v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/616f/11957224/af89e8728bcf/nihpp-2503.13399v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/616f/11957224/2cf7a0f532a4/nihpp-2503.13399v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/616f/11957224/742146cc3457/nihpp-2503.13399v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/616f/11957224/9fae3f882f1e/nihpp-2503.13399v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/616f/11957224/4ba7cb12901c/nihpp-2503.13399v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/616f/11957224/af89e8728bcf/nihpp-2503.13399v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/616f/11957224/2cf7a0f532a4/nihpp-2503.13399v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/616f/11957224/742146cc3457/nihpp-2503.13399v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/616f/11957224/9fae3f882f1e/nihpp-2503.13399v1-f0005.jpg

相似文献

1
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research.MicroVQA:基于显微镜的科学研究的多模态推理基准
ArXiv. 2025 Mar 17:arXiv:2503.13399v1.
2
Interpretable medical image Visual Question Answering via multi-modal relationship graph learning.基于多模态关系图学习的可解释医学图像视觉问答。
Med Image Anal. 2024 Oct;97:103279. doi: 10.1016/j.media.2024.103279. Epub 2024 Jul 20.
3
CARDBiomedBench: A Benchmark for Evaluating Large Language Model Performance in Biomedical Research: A novel question-and-answer benchmark designed to assess Large Language Models' comprehension of biomedical research, piloted on Neurodegenerative Diseases.CARDBiomedBench:评估大型语言模型在生物医学研究中性能的基准:一个旨在评估大型语言模型对生物医学研究理解能力的新型问答基准,已在神经退行性疾病领域进行试点。
bioRxiv. 2025 Jan 21:2025.01.15.633272. doi: 10.1101/2025.01.15.633272.
4
Prophet: Prompting Large Language Models With Complementary Answer Heuristics for Knowledge-Based Visual Question Answering.Prophet:通过互补答案启发式方法提示大语言模型以进行基于知识的视觉问答
IEEE Trans Pattern Anal Mach Intell. 2025 Aug;47(8):6797-6808. doi: 10.1109/TPAMI.2025.3562422.
5
A survey on multimodal large language models.关于多模态大语言模型的一项调查。
Natl Sci Rev. 2024 Nov 12;11(12):nwae403. doi: 10.1093/nsr/nwae403. eCollection 2024 Dec.
6
A Comparative Analysis of the Performance of Large Language Models and Human Respondents in Dermatology.大语言模型与人类受试者在皮肤病学方面表现的比较分析
Indian Dermatol Online J. 2025 Feb 27;16(2):241-247. doi: 10.4103/idoj.idoj_221_24. eCollection 2025 Mar-Apr.
7
Arch-Eval benchmark for assessing chinese architectural domain knowledge in large language models.用于评估大语言模型中中国建筑领域知识的Arch-Eval基准测试。
Sci Rep. 2025 Apr 18;15(1):13485. doi: 10.1038/s41598-025-98236-0.
8
MMAgentRec, a personalized multi-modal recommendation agent with large language model.MMAgentRec,一个带有大语言模型的个性化多模态推荐代理。
Sci Rep. 2025 Apr 8;15(1):12062. doi: 10.1038/s41598-025-96458-w.
9
Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding.知识引导的视觉问题推理:深度表示嵌入面临的挑战
IEEE Trans Neural Netw Learn Syst. 2022 Jul;33(7):2758-2767. doi: 10.1109/TNNLS.2020.3045034. Epub 2022 Jul 6.
10
Assessing the performance of zero-shot visual question answering in multimodal large language models for 12-lead ECG image interpretation.评估多模态大语言模型中零样本视觉问答在12导联心电图图像解读方面的性能。
Front Cardiovasc Med. 2025 Feb 6;12:1458289. doi: 10.3389/fcvm.2025.1458289. eCollection 2025.

本文引用的文献

1
Biomedical Visual Instruction Tuning with Clinician Preference Alignment.基于临床医生偏好对齐的生物医学视觉指令微调
Adv Neural Inf Process Syst. 2024 Dec;37:96449-96467.
2
Global organelle profiling reveals subcellular localization and remodeling at proteome scale.全球细胞器分析揭示了蛋白质组规模下的亚细胞定位和重塑。
Cell. 2025 Feb 20;188(4):1137-1155.e20. doi: 10.1016/j.cell.2024.11.028. Epub 2024 Dec 31.
3
Empowering biomedical discovery with AI agents.利用人工智能代理增强生物医学发现。
Cell. 2024 Oct 31;187(22):6125-6151. doi: 10.1016/j.cell.2024.09.022.
4
A generalist vision-language foundation model for diverse biomedical tasks.一种适用于多种生物医学任务的通才视觉语言基础模型。
Nat Med. 2024 Nov;30(11):3129-3141. doi: 10.1038/s41591-024-03185-2. Epub 2024 Aug 7.
5
A multimodal generative AI copilot for human pathology.用于人体病理学的多模态生成式人工智能副驾。
Nature. 2024 Oct;634(8033):466-473. doi: 10.1038/s41586-024-07618-3. Epub 2024 Jun 12.
6
Omega - harnessing the power of large language models for bioimage analysis.欧米茄——利用大语言模型的力量进行生物图像分析。
Nat Methods. 2024 Aug;21(8):1371-1373. doi: 10.1038/s41592-024-02310-w.
7
ChatMOF: an artificial intelligence system for predicting and generating metal-organic frameworks using large language models.ChatMOF:一种使用大语言模型预测和生成金属有机框架的人工智能系统。
Nat Commun. 2024 Jun 3;15(1):4705. doi: 10.1038/s41467-024-48998-4.
8
Augmenting large language models with chemistry tools.用化学工具增强大语言模型。
Nat Mach Intell. 2024;6(5):525-535. doi: 10.1038/s42256-024-00832-8. Epub 2024 May 8.
9
Orientation-invariant autoencoders learn robust representations for shape profiling of cells and organelles.不变性自编码器学习用于细胞和细胞器形状分析的鲁棒表示。
Nat Commun. 2024 Feb 3;15(1):1022. doi: 10.1038/s41467-024-45362-4.
10
Autonomous chemical research with large language models.大语言模型驱动的自主化学研究。
Nature. 2023 Dec;624(7992):570-578. doi: 10.1038/s41586-023-06792-0. Epub 2023 Dec 20.