一种面向基于图像的文档检索的生物医学图像分割框架。

A framework for biomedical figure segmentation towards image-based document retrieval.

作者信息

Lopez Luis D, Yu Jingyi, Arighi Cecilia, Tudor Catalina O, Torii Manabu, Huang Hongzhan, Vijay-Shanker K, Wu Cathy

出版信息

BMC Syst Biol. 2013;7 Suppl 4(Suppl 4):S8. doi: 10.1186/1752-0509-7-S4-S8. Epub 2013 Oct 23.

DOI:10.1186/1752-0509-7-S4-S8

PMID:24565394

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3856606/

Abstract

The figures included in many of the biomedical publications play an important role in understanding the biological experiments and facts described within. Recent studies have shown that it is possible to integrate the information that is extracted from figures in classical document classification and retrieval tasks in order to improve their accuracy. One important observation about the figures included in biomedical publications is that they are often composed of multiple subfigures or panels, each describing different methodologies or results. The use of these multimodal figures is a common practice in bioscience, as experimental results are graphically validated via multiple methodologies or procedures. Thus, for a better use of multimodal figures in document classification or retrieval tasks, as well as for providing the evidence source for derived assertions, it is important to automatically segment multimodal figures into subfigures and panels. This is a challenging task, however, as different panels can contain similar objects (i.e., barcharts and linecharts) with multiple layouts. Also, certain types of biomedical figures are text-heavy (e.g., DNA sequences and protein sequences images) and they differ from traditional images. As a result, classical image segmentation techniques based on low-level image features, such as edges or color, are not directly applicable to robustly partition multimodal figures into single modal panels. In this paper, we describe a robust solution for automatically identifying and segmenting unimodal panels from a multimodal figure. Our framework starts by robustly harvesting figure-caption pairs from biomedical articles. We base our approach on the observation that the document layout can be used to identify encoded figures and figure boundaries within PDF files. Taking into consideration the document layout allows us to correctly extract figures from the PDF document and associate their corresponding caption. We combine pixel-level representations of the extracted images with information gathered from their corresponding captions to estimate the number of panels in the figure. Thus, our approach simultaneously identifies the number of panels and the layout of figures. In order to evaluate the approach described here, we applied our system on documents containing protein-protein interactions (PPIs) and compared the results against a gold standard that was annotated by biologists. Experimental results showed that our automatic figure segmentation approach surpasses pure caption-based and image-based approaches, achieving a 96.64% accuracy. To allow for efficient retrieval of information, as well as to provide the basis for integration into document classification and retrieval systems among other, we further developed a web-based interface that lets users easily retrieve panels containing the terms specified in the user queries.

摘要

许多生物医学出版物中的图表对于理解其中所描述的生物学实验和事实起着重要作用。最近的研究表明，在经典的文档分类和检索任务中，整合从图表中提取的信息以提高其准确性是可行的。关于生物医学出版物中图表的一个重要观察结果是，它们通常由多个子图或面板组成，每个子图或面板描述不同的方法或结果。在生物科学中，使用这些多模态图表是一种常见的做法，因为实验结果通过多种方法或程序进行图形验证。因此，为了在文档分类或检索任务中更好地使用多模态图表，并为派生断言提供证据来源，将多模态图表自动分割成子图和面板非常重要。然而，这是一项具有挑战性的任务，因为不同的面板可能包含具有多种布局的相似对象（例如，柱状图和折线图）。此外，某些类型的生物医学图表文本较多（例如，DNA序列和蛋白质序列图像），它们与传统图像不同。因此，基于边缘或颜色等低级图像特征的经典图像分割技术不能直接适用于将多模态图表稳健地分割成单模态面板。在本文中，我们描述了一种从多模态图表中自动识别和分割单模态面板的稳健解决方案。我们的框架首先从生物医学文章中稳健地收集图表 - 标题对。我们的方法基于这样的观察结果，即文档布局可用于识别PDF文件中的编码图表和图表边界。考虑文档布局使我们能够从PDF文档中正确提取图表并关联其相应的标题。我们将提取图像的像素级表示与从其相应标题中收集的信息相结合，以估计图表中的面板数量。因此，我们的方法同时识别面板数量和图表布局。为了评估此处描述的方法，我们将我们的系统应用于包含蛋白质 - 蛋白质相互作用（PPI）的文档，并将结果与生物学家注释的金标准进行比较。实验结果表明，我们的自动图表分割方法优于基于纯标题和基于图像的方法，准确率达到96.64％。为了实现信息的高效检索，并为集成到文档分类和检索系统等提供基础，我们进一步开发了一个基于网络的界面，让用户可以轻松检索包含用户查询中指定术语的面板。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0703/3856606/8dbf47e34595/1752-0509-7-S4-S8-1.jpg

相似文献

A framework for biomedical figure segmentation towards image-based document retrieval.一种面向基于图像的文档检索的生物医学图像分割框架。

BMC Syst Biol. 2013;7 Suppl 4(Suppl 4):S8. doi: 10.1186/1752-0509-7-S4-S8. Epub 2013 Oct 23.

Figure text extraction in biomedical literature.生物医学文献中的图表文本提取。

PLoS One. 2011 Jan 13;6(1):e15338. doi: 10.1371/journal.pone.0015338.

Figure and caption extraction from biomedical documents.从生物医学文献中提取图和标题。

Bioinformatics. 2019 Nov 1;35(21):4381-4388. doi: 10.1093/bioinformatics/btz228.

Classifying Biomedical Figures by Modality via Multi-Label Learning.通过多标签学习对生物医学图像进行模态分类。

IEEE J Biomed Health Inform. 2019 Nov;23(6):2230-2237. doi: 10.1109/JBHI.2019.2902303. Epub 2019 Feb 28.

Full text and figure display improves bioscience literature search.全文和图形展示可提高生物科学文献检索效果。

PLoS One. 2010 Apr 14;5(4):e9619. doi: 10.1371/journal.pone.0009619.

Utilizing image and caption information for biomedical document classification.利用图像和标题信息进行生物医学文献分类。

Bioinformatics. 2021 Jul 12;37(Suppl_1):i468-i476. doi: 10.1093/bioinformatics/btab331.

Integrating image caption information into biomedical document classification in support of biocuration.将图像标题信息整合到生物医学文献分类中，以支持生物注释。

Database (Oxford). 2020 Jan 1;2020. doi: 10.1093/database/baaa024.

Exploring the use of image text for biomedical literature retrieval.探索图像文本在生物医学文献检索中的应用。

AMIA Annu Symp Proc. 2008 Nov 6:1186.

Integrating image data into biomedical text categorization.将图像数据整合到生物医学文本分类中。

Bioinformatics. 2006 Jul 15;22(14):e446-53. doi: 10.1093/bioinformatics/btl235.

A novel figure panel classification and extraction method for document image understanding.一种用于文档图像理解的新型图形面板分类与提取方法。

Int J Data Min Bioinform. 2014;9(1):22-36. doi: 10.1504/ijdmb.2014.057779.

引用本文的文献

A hybrid multi-panel image segmentation framework for improved medical image retrieval system.一种用于改进医学图像检索系统的混合多面板图像分割框架。

PLoS One. 2025 Feb 20;20(2):e0315823. doi: 10.1371/journal.pone.0315823. eCollection 2025.

Enhancing thoracic disease detection using chest X-rays from PubMed Central Open Access.利用 PubMed Central 开放获取中的胸部 X 光片增强胸部疾病检测。

Comput Biol Med. 2023 Jun;159:106962. doi: 10.1016/j.compbiomed.2023.106962. Epub 2023 Apr 20.

COVID-19-CT-CXR: A Freely Accessible and Weakly Labeled Chest X-Ray and CT Image Collection on COVID-19 From Biomedical Literature.COVID-19-CT-CXR：一个可免费获取的、基于生物医学文献的关于COVID-19的弱标注胸部X光和CT图像集。

IEEE Trans Big Data. 2021 Mar 1;7(1):3-12. doi: 10.1109/tbdata.2020.3035935. Epub 2020 Nov 4.

Segmenting Compound Biomedical Figures into Their Constituent Panels.将复合生物医学图像分割成其组成面板。

Exp IR Meets Multilinguality Multimodality Interact (2017). 2017 Sep;10456:199-210. doi: 10.1007/978-3-319-65813-1_20. Epub 2017 Aug 17.

Compound image segmentation of published biomedical figures.发表的生物医学图像的组合图像分割。

Bioinformatics. 2018 Apr 1;34(7):1192-1199. doi: 10.1093/bioinformatics/btx611.

Mining biomedical images towards valuable information retrieval in biomedical and life sciences.挖掘生物医学图像以实现生物医学和生命科学中有价值的信息检索。

Database (Oxford). 2016 Aug 18;2016. doi: 10.1093/database/baw118. Print 2016.

DeTEXT: A Database for Evaluating Text Extraction from Biomedical Literature Figures.DeTEXT：一个用于评估从生物医学文献图表中提取文本的数据库。

PLoS One. 2015 May 7;10(5):e0126200. doi: 10.1371/journal.pone.0126200. eCollection 2015.

本文引用的文献

Protein interaction data curation: the International Molecular Exchange (IMEx) consortium.蛋白质相互作用数据编纂：国际分子交换（IMEx）联盟。

Nat Methods. 2012 Apr;9(4):345-50. doi: 10.1038/nmeth.1931.

Figure text extraction in biomedical literature.生物医学文献中的图表文本提取。

PLoS One. 2011 Jan 13;6(1):e15338. doi: 10.1371/journal.pone.0015338.

MINT, the molecular interaction database: 2009 update.MINT，分子相互作用数据库：2009 年更新。

Nucleic Acids Res. 2010 Jan;38(Database issue):D532-9. doi: 10.1093/nar/gkp983. Epub 2009 Nov 6.

Annotation and retrieval of clinically relevant images.临床相关图像的标注与检索。

Int J Med Inform. 2009 Dec;78(12):e59-67. doi: 10.1016/j.ijmedinf.2009.05.003. Epub 2009 Jul 9.

A framework for white blood cell segmentation in microscopic blood images using digital image processing.利用数字图像处理技术对显微镜下的血图像中的白细胞进行分割的框架。

Biol Proced Online. 2009 Jun 11;11:196-206. doi: 10.1007/s12575-009-9011-2.

Yale Image Finder (YIF): a new search engine for retrieving biomedical images.耶鲁图像搜索器（YIF）：一种用于检索生物医学图像的新型搜索引擎。

Bioinformatics. 2008 Sep 1;24(17):1968-70. doi: 10.1093/bioinformatics/btn340. Epub 2008 Jul 9.

Size-invariant descriptors for detecting regions of abnormal growth in cervical vertebrae.用于检测颈椎异常生长区域的尺寸不变描述符。

Comput Med Imaging Graph. 2008 Jan;32(1):44-52. doi: 10.1016/j.compmedimag.2007.09.002. Epub 2007 Oct 22.

Integrating image data into biomedical text categorization.将图像数据整合到生物医学文本分类中。

Bioinformatics. 2006 Jul 15;22(14):e446-53. doi: 10.1093/bioinformatics/btl235.

Identification of phosphocaveolin-1 as a novel protein tyrosine phosphatase 1B substrate.鉴定磷酸化小窝蛋白-1为一种新型蛋白酪氨酸磷酸酶1B底物。

Biochemistry. 2006 Jan 10;45(1):234-40. doi: 10.1021/bi051560j.

Coexpression of MAST205 inhibits the activity of Na+/H+ exchanger NHE3.MAST205的共表达抑制了Na+/H+交换体NHE3的活性。

Am J Physiol Renal Physiol. 2006 Feb;290(2):F428-37. doi: 10.1152/ajprenal.00161.2005. Epub 2005 Sep 13.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一种面向基于图像的文档检索的生物医学图像分割框架。

A framework for biomedical figure segmentation towards image-based document retrieval.

作者信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献