Wan Shibiao, Mak Man-Wai, Kung Sun-Yuan
Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China.
Department of Electrical Engineering, Princeton University, NJ, USA.
J Theor Biol. 2014 Nov 7;360:34-45. doi: 10.1016/j.jtbi.2014.06.031. Epub 2014 Jul 2.
Locating proteins within cellular contexts is of paramount significance in elucidating their biological functions. Computational methods based on knowledge databases (such as gene ontology annotation (GOA) database) are known to be more efficient than sequence-based methods. However, the predominant scenarios of knowledge-based methods are that (1) knowledge databases typically have enormous size and are growing exponentially, (2) knowledge databases contain redundant information, and (3) the number of extracted features from knowledge databases is much larger than the number of data samples with ground-truth labels. These properties render the extracted features liable to redundant or irrelevant information, causing the prediction systems suffer from overfitting. To address these problems, this paper proposes an efficient multi-label predictor, namely R3P-Loc, which uses two compact databases for feature extraction and applies random projection (RP) to reduce the feature dimensions of an ensemble ridge regression (RR) classifier. Two new compact databases are created from Swiss-Prot and GOA databases. These databases possess almost the same amount of information as their full-size counterparts but with much smaller size. Experimental results on two recent datasets (eukaryote and plant) suggest that R3P-Loc can reduce the dimensions by seven-folds and significantly outperforms state-of-the-art predictors. This paper also demonstrates that the compact databases reduce the memory consumption by 39 times without causing degradation in prediction accuracy. For readers׳ convenience, the R3P-Loc server is available online at url:http://bioinfo.eie.polyu.edu.hk/R3PLocServer/.
在细胞环境中定位蛋白质对于阐明其生物学功能至关重要。已知基于知识数据库(如基因本体注释(GOA)数据库)的计算方法比基于序列的方法更有效。然而,基于知识的方法的主要情况是:(1)知识数据库通常规模巨大且呈指数增长;(2)知识数据库包含冗余信息;(3)从知识数据库中提取的特征数量远大于带有真实标签的数据样本数量。这些特性使得提取的特征容易包含冗余或不相关信息,导致预测系统出现过拟合。为了解决这些问题,本文提出了一种高效的多标签预测器,即R3P-Loc,它使用两个紧凑数据库进行特征提取,并应用随机投影(RP)来降低集成岭回归(RR)分类器的特征维度。从Swiss-Prot和GOA数据库创建了两个新的紧凑数据库。这些数据库拥有与其全尺寸对应数据库几乎相同数量的信息,但规模要小得多。在最近的两个数据集(真核生物和植物)上的实验结果表明,R3P-Loc可以将维度降低七倍,并且显著优于现有最先进的预测器。本文还表明,紧凑数据库将内存消耗降低了39倍,而不会导致预测准确性下降。为方便读者,R3P-Loc服务器可在线访问网址:http://bioinfo.eie.polyu.edu.hk/R3PLocServer/ 。