Suppr超能文献

利用基于 NMF 的多标签分类挖掘 MEDLINE 进行基因分子功能预测。

Exploiting MEDLINE for gene molecular function prediction via NMF based multi-label classification.

机构信息

Yale Center for Medical Informatics, Yale University, 300 George st, Suite 501, New Haven, CT 06511, United States.

University of Massachusetts Amherst, United States.

出版信息

J Biomed Inform. 2018 Oct;86:160-166. doi: 10.1016/j.jbi.2018.08.009. Epub 2018 Aug 18.

Abstract

Gene ontology (GO) provides a representation of terms and categories used to describe genes and their molecular functions, cellular components and biological processes. GO has been the standard for describing the functions of specific genes in different model organisms. GO annotation, or the tagging of genes with GO terms, has mostly been a manual and time-consuming curation process. Although many automated approaches have been proposed for annotation, few have utilized knowledge available in the literature. In this manuscript, we describe the development and evaluation of an innovative predictive system to automatically assign molecular functions (GO terms) to genes using the biomedical literature. Because genes could be associated with multiple molecular functions, we posed the GO molecular function annotation as a multi-label classification problem with several classes. We used non-negative matrix factorization (NMF) for feature reduction and then classified the genes. To address the multi-label aspect of the data, we used the binary-relevance method. Although we experimented with several classifiers, the combination of binary-relevance and K-nearest neighbor (KNN) classifier performed best. Our evaluation on UniProtKB/Swiss-Prot dataset showed the best performance of 0.84 in terms of F1-measure.

摘要

基因本体论(GO)提供了术语和类别表示,用于描述基因及其分子功能、细胞成分和生物过程。GO 一直是描述不同模式生物中特定基因功能的标准。GO 注释,即将 GO 术语标记到基因上,主要是一个手动且耗时的策展过程。尽管已经提出了许多用于注释的自动化方法,但很少有方法利用文献中可用的知识。在本文中,我们描述了一种创新的预测系统的开发和评估,该系统使用生物医学文献自动为基因分配分子功能(GO 术语)。由于一个基因可能与多个分子功能相关联,因此我们将 GO 分子功能注释作为一个具有多个类别的多标签分类问题。我们使用非负矩阵分解(NMF)进行特征降维,然后对基因进行分类。为了解决数据的多标签方面,我们使用了二进制相关性方法。尽管我们尝试了几种分类器,但二进制相关性和 K 最近邻(KNN)分类器的组合表现最佳。我们在 UniProtKB/Swiss-Prot 数据集上的评估显示,在 F1 度量方面的最佳性能为 0.84。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验