Suppr超能文献

一种面向基于图像的文档检索的生物医学图像分割框架。

A framework for biomedical figure segmentation towards image-based document retrieval.

作者信息

Lopez Luis D, Yu Jingyi, Arighi Cecilia, Tudor Catalina O, Torii Manabu, Huang Hongzhan, Vijay-Shanker K, Wu Cathy

出版信息

BMC Syst Biol. 2013;7 Suppl 4(Suppl 4):S8. doi: 10.1186/1752-0509-7-S4-S8. Epub 2013 Oct 23.

Abstract

The figures included in many of the biomedical publications play an important role in understanding the biological experiments and facts described within. Recent studies have shown that it is possible to integrate the information that is extracted from figures in classical document classification and retrieval tasks in order to improve their accuracy. One important observation about the figures included in biomedical publications is that they are often composed of multiple subfigures or panels, each describing different methodologies or results. The use of these multimodal figures is a common practice in bioscience, as experimental results are graphically validated via multiple methodologies or procedures. Thus, for a better use of multimodal figures in document classification or retrieval tasks, as well as for providing the evidence source for derived assertions, it is important to automatically segment multimodal figures into subfigures and panels. This is a challenging task, however, as different panels can contain similar objects (i.e., barcharts and linecharts) with multiple layouts. Also, certain types of biomedical figures are text-heavy (e.g., DNA sequences and protein sequences images) and they differ from traditional images. As a result, classical image segmentation techniques based on low-level image features, such as edges or color, are not directly applicable to robustly partition multimodal figures into single modal panels. In this paper, we describe a robust solution for automatically identifying and segmenting unimodal panels from a multimodal figure. Our framework starts by robustly harvesting figure-caption pairs from biomedical articles. We base our approach on the observation that the document layout can be used to identify encoded figures and figure boundaries within PDF files. Taking into consideration the document layout allows us to correctly extract figures from the PDF document and associate their corresponding caption. We combine pixel-level representations of the extracted images with information gathered from their corresponding captions to estimate the number of panels in the figure. Thus, our approach simultaneously identifies the number of panels and the layout of figures. In order to evaluate the approach described here, we applied our system on documents containing protein-protein interactions (PPIs) and compared the results against a gold standard that was annotated by biologists. Experimental results showed that our automatic figure segmentation approach surpasses pure caption-based and image-based approaches, achieving a 96.64% accuracy. To allow for efficient retrieval of information, as well as to provide the basis for integration into document classification and retrieval systems among other, we further developed a web-based interface that lets users easily retrieve panels containing the terms specified in the user queries.

摘要

许多生物医学出版物中的图表对于理解其中所描述的生物学实验和事实起着重要作用。最近的研究表明,在经典的文档分类和检索任务中,整合从图表中提取的信息以提高其准确性是可行的。关于生物医学出版物中图表的一个重要观察结果是,它们通常由多个子图或面板组成,每个子图或面板描述不同的方法或结果。在生物科学中,使用这些多模态图表是一种常见的做法,因为实验结果通过多种方法或程序进行图形验证。因此,为了在文档分类或检索任务中更好地使用多模态图表,并为派生断言提供证据来源,将多模态图表自动分割成子图和面板非常重要。然而,这是一项具有挑战性的任务,因为不同的面板可能包含具有多种布局的相似对象(例如,柱状图和折线图)。此外,某些类型的生物医学图表文本较多(例如,DNA序列和蛋白质序列图像),它们与传统图像不同。因此,基于边缘或颜色等低级图像特征的经典图像分割技术不能直接适用于将多模态图表稳健地分割成单模态面板。在本文中,我们描述了一种从多模态图表中自动识别和分割单模态面板的稳健解决方案。我们的框架首先从生物医学文章中稳健地收集图表 - 标题对。我们的方法基于这样的观察结果,即文档布局可用于识别PDF文件中的编码图表和图表边界。考虑文档布局使我们能够从PDF文档中正确提取图表并关联其相应的标题。我们将提取图像的像素级表示与从其相应标题中收集的信息相结合,以估计图表中的面板数量。因此,我们的方法同时识别面板数量和图表布局。为了评估此处描述的方法,我们将我们的系统应用于包含蛋白质 - 蛋白质相互作用(PPI)的文档,并将结果与生物学家注释的金标准进行比较。实验结果表明,我们的自动图表分割方法优于基于纯标题和基于图像的方法,准确率达到96.64%。为了实现信息的高效检索,并为集成到文档分类和检索系统等提供基础,我们进一步开发了一个基于网络的界面,让用户可以轻松检索包含用户查询中指定术语的面板。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0703/3856606/8dbf47e34595/1752-0509-7-S4-S8-1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验