The Computational Biomedicine and Machine Learning Lab, Department of Computer & Information Sciences, University of Delaware, 18 Amstel Ave, Newark, DE 19716, USA.
The Jackson Laboratory, 600 Main Street, Bar Harbor, ME, USA.
Database (Oxford). 2020 Jan 1;2020. doi: 10.1093/database/baaa024.
Gathering information from the scientific literature is essential for biomedical research, as much knowledge is conveyed through publications. However, the large and rapidly increasing publication rate makes it impractical for researchers to quickly identify all and only those documents related to their interest. As such, automated biomedical document classification attracts much interest. Such classification is critical in the curation of biological databases, because biocurators must scan through a vast number of articles to identify pertinent information within documents most relevant to the database. This is a slow, labor-intensive process that can benefit from effective automation. We present a document classification scheme aiming to identify papers containing information relevant to a specific topic, among a large collection of articles, for supporting the biocuration classification task. Our framework is based on a meta-classification scheme we have introduced before; here we incorporate into it features gathered from figure captions, in addition to those obtained from titles and abstracts. We trained and tested our classifier over a large imbalanced dataset, originally curated by the Gene Expression Database (GXD). GXD collects all the gene expression information in the Mouse Genome Informatics (MGI) resource. As part of the MGI literature classification pipeline, GXD curators identify MGI-selected papers that are relevant for GXD. The dataset consists of ~60 000 documents (5469 labeled as relevant; 52 866 as irrelevant), gathered throughout 2012-2016, in which each document is represented by the text of its title, abstract and figure captions. Our classifier attains precision 0.698, recall 0.784, f-measure 0.738 and Matthews correlation coefficient 0.711, demonstrating that the proposed framework effectively addresses the high imbalance in the GXD classification task. Moreover, our classifier's performance is significantly improved by utilizing information from image captions compared to using titles and abstracts alone; this observation clearly demonstrates that image captions provide substantial information for supporting biomedical document classification and curation. Database URL.
从科学文献中收集信息对于生物医学研究至关重要,因为很多知识都是通过出版物传达的。然而,由于出版物数量庞大且增长迅速,研究人员很难快速识别出所有与他们的兴趣相关的文献。因此,自动化生物医学文献分类吸引了很多关注。这种分类在生物数据库的管理中至关重要,因为生物数据库管理人员必须浏览大量的文章,以识别文档中与数据库最相关的特定信息。这是一个缓慢、劳动密集型的过程,可以通过有效的自动化来提高效率。我们提出了一种文档分类方案,旨在从大量文章中识别与特定主题相关的论文,以支持生物数据库管理人员的分类任务。我们的框架基于我们之前引入的元分类方案;在这里,我们将从标题中获取的特征与从标题和摘要中获取的特征相结合。我们在一个大型不平衡数据集上进行了训练和测试,该数据集最初由基因表达数据库(GXD)整理。GXD 收集了 Mouse Genome Informatics (MGI) 资源中的所有基因表达信息。作为 MGI 文献分类管道的一部分,GXD 管理人员识别出与 GXD 相关的 MGI 选择的论文。该数据集由 60000 多个文档组成(5469 个标记为相关;52866 个标记为不相关),这些文档是在 2012 年至 2016 年期间收集的,每个文档都由其标题、摘要和图像标题的文本表示。我们的分类器的精度为 0.698,召回率为 0.784,F1 值为 0.738,马修斯相关系数为 0.711,这表明所提出的框架有效地解决了 GXD 分类任务中的高度不平衡问题。此外,与仅使用标题和摘要相比,利用图像标题提供的信息显著提高了分类器的性能;这一观察结果清楚地表明,图像标题为支持生物医学文档分类和管理提供了重要信息。数据库 URL。