Suppr超能文献

通过多标签线性判别分析从蛋白质序列到蛋白质功能

From Protein Sequence to Protein Function via Multi-Label Linear Discriminant Analysis.

作者信息

Wang Hua, Yan Lin, Huang Heng, Ding Chris

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2017 May-Jun;14(3):503-513. doi: 10.1109/TCBB.2016.2591529. Epub 2016 Jul 14.

Abstract

Sequence describes the primary structure of a protein, which contains important structural, characteristic, and genetic information and thereby motivates many sequence-based computational approaches to infer protein function. Among them, feature-base approaches attract increased attention because they make prediction from a set of transformed and more biologically meaningful sequence features. However, original features extracted from sequence are usually of high dimensionality and often compromised by irrelevant patterns, therefore dimension reduction is necessary prior to classification for efficient and effective protein function prediction. A protein usually performs several different functions within an organism, which makes protein function prediction a multi-label classification problem. In machine learning, multi-label classification deals with problems where each object may belong to more than one class. As a well-known feature reduction method, linear discriminant analysis (LDA) has been successfully applied in many practical applications. It, however, by nature is designed for single-label classification, in which each object can belong to exactly one class. Because directly applying LDA in multi-label classification causes ambiguity when computing scatters matrices, we apply a new Multi-label Linear Discriminant Analysis (MLDA) approach to address this problem and meanwhile preserve powerful classification capability inherited from classical LDA. We further extend MLDA by l-normalization to overcome the problem of over-counting data points with multiple labels. In addition, we incorporate biological network data using Laplacian embedding into our method, and assess the reliability of predicted putative functions. Extensive empirical evaluations demonstrate promising results of our methods.

摘要

序列描述了蛋白质的一级结构,其包含重要的结构、特征和遗传信息,从而推动了许多基于序列的计算方法来推断蛋白质功能。其中,基于特征的方法越来越受到关注,因为它们从一组经过变换且更具生物学意义的序列特征进行预测。然而,从序列中提取的原始特征通常具有高维度,并且常常受到无关模式的影响,因此在进行分类以实现高效且有效的蛋白质功能预测之前,降维是必要的。一种蛋白质通常在生物体中执行几种不同的功能,这使得蛋白质功能预测成为一个多标签分类问题。在机器学习中,多标签分类处理的是每个对象可能属于多个类别的问题。作为一种著名的特征约简方法,线性判别分析(LDA)已在许多实际应用中成功应用。然而,它本质上是为单标签分类设计的,其中每个对象只能属于一个类别。由于直接将LDA应用于多标签分类在计算散度矩阵时会导致模糊性,我们应用一种新的多标签线性判别分析(MLDA)方法来解决这个问题,同时保留从经典LDA继承的强大分类能力。我们通过l-归一化进一步扩展MLDA以克服对具有多个标签的数据点计数过多的问题。此外,我们使用拉普拉斯嵌入将生物网络数据纳入我们的方法,并评估预测的假定功能的可靠性。广泛的实证评估证明了我们方法的良好结果。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验