Suppr超能文献

用于识别与小鼠基因表达数据库(GXD)相关出版物的有效生物医学文献分类。

Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD).

作者信息

Jiang Xiangying, Ringwald Martin, Blake Judith, Shatkay Hagit

机构信息

Department of Computer and Information Sciences, University of Delaware, 101 Smith Hall, Newark, DE, USA.

Department of Computer and Information Sciences, The Jackson Laboratory, 600 Main Street, Bar Harbor, ME, USA.

出版信息

Database (Oxford). 2017 Jan 1;2017(1). doi: 10.1093/database/bax017.

Abstract

UNLABELLED

The Gene Expression Database (GXD) is a comprehensive online database within the Mouse Genome Informatics resource, aiming to provide available information about endogenous gene expression during mouse development. The information stems primarily from many thousands of biomedical publications that database curators must go through and read. Given the very large number of biomedical papers published each year, automatic document classification plays an important role in biomedical research. Specifically, an effective and efficient document classifier is needed for supporting the GXD annotation workflow. We present here an effective yet relatively simple classification scheme, which uses readily available tools while employing feature selection, aiming to assist curators in identifying publications relevant to GXD. We examine the performance of our method over a large manually curated dataset, consisting of more than 25 000 PubMed abstracts, of which about half are curated as relevant to GXD while the other half as irrelevant to GXD. In addition to text from title-and-abstract, we also consider image captions, an important information source that we integrate into our method. We apply a captions-based classifier to a subset of about 3300 documents, for which the full text of the curated articles is available. The results demonstrate that our proposed approach is robust and effectively addresses the GXD document classification. Moreover, using information obtained from image captions clearly improves performance, compared to title and abstract alone, affirming the utility of image captions as a substantial evidence source for automatically determining the relevance of biomedical publications to a specific subject area.

DATABASE URL

www.informatics.jax.org.

摘要

未标注

基因表达数据库(GXD)是小鼠基因组信息学资源中的一个综合性在线数据库,旨在提供有关小鼠发育过程中内源性基因表达的可用信息。这些信息主要源于数据库管理员必须查阅的数千篇生物医学出版物。鉴于每年发表的生物医学论文数量众多,自动文档分类在生物医学研究中发挥着重要作用。具体而言,需要一个有效且高效的文档分类器来支持GXD注释工作流程。我们在此提出一种有效但相对简单的分类方案,该方案使用现成的工具同时进行特征选择,旨在帮助管理员识别与GXD相关的出版物。我们在一个大型人工整理的数据集上检验了我们方法的性能,该数据集由超过25000篇PubMed摘要组成,其中约一半被整理为与GXD相关,另一半与GXD不相关。除了标题和摘要中的文本,我们还考虑图像标题,这是一个重要的信息来源,我们将其整合到我们的方法中。我们将基于标题的分类器应用于大约3300篇文档的子集,这些文档有整理好的文章全文。结果表明,我们提出的方法是稳健的,有效地解决了GXD文档分类问题。此外,与仅使用标题和摘要相比,使用从图像标题中获得的信息明显提高了性能,这证实了图像标题作为自动确定生物医学出版物与特定主题领域相关性的重要证据来源的实用性。

数据库网址

www.informatics.jax.org。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f1cd/5467553/3a0ed6abd0f6/bax017f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验