Comparative Bioinformatics, Bioinformatics and Genomics, Centre for Genomic Regulation (CRG), Barcelona, Spain ; Universitat Pompeu Fabra (UPF), Barcelona, Spain.
PLoS One. 2013 Oct 11;8(10):e75542. doi: 10.1371/journal.pone.0075542. eCollection 2013.
Predicting protein functional classes such as localization sites and modifications plays a crucial role in function annotation. Given a tremendous amount of sequence data yielded from high-throughput sequencing experiments, the need of efficient and interpretable prediction strategies has been rapidly amplified. Our previous approach for subcellular localization prediction, PSLDoc, archives high overall accuracy for Gram-negative bacteria. However, PSLDoc is computational intensive due to incorporation of homology extension in feature extraction and probabilistic latent semantic analysis in feature reduction. Besides, prediction results generated by support vector machines are accurate but generally difficult to interpret. In this work, we incorporate three new techniques to improve efficiency and interpretability. First, homology extension is performed against a compact non-redundant database using a fast search model to reduce running time. Second, correspondence analysis (CA) is incorporated as an efficient feature reduction to generate a clear visual separation of different protein classes. Finally, functional classes are predicted by a combination of accurate compact set (CS) relation and interpretable one-nearest neighbor (1-NN) algorithm. Besides localization data sets, we also apply a human protein kinase set to validate generality of our proposed method. Experiment results demonstrate that our method make accurate prediction in a more efficient and interpretable manner. First, homology extension using a fast search on a compact database can greatly accelerate traditional running time up to twenty-five times faster without sacrificing prediction performance. This suggests that computational costs of many other predictors that also incorporate homology information can be largely reduced. In addition, CA can not only efficiently identify discriminative features but also provide a clear visualization of different functional classes. Moreover, predictions based on CS achieve 100% precision. When combined with 1-NN on unpredicted targets by CS, our method attains slightly better or comparable performance compared with the state-of-the-art systems.
预测蛋白质的功能类别,如定位点和修饰,在功能注释中起着至关重要的作用。由于高通量测序实验产生了大量的序列数据,因此对高效和可解释的预测策略的需求迅速增加。我们之前用于亚细胞定位预测的方法 PSLDoc 在革兰氏阴性菌中具有很高的整体准确性。然而,由于在特征提取中包含同源延伸和在特征降维中包含概率潜在语义分析,PSLDoc 的计算量很大。此外,支持向量机生成的预测结果准确但通常难以解释。在这项工作中,我们结合了三种新技术来提高效率和可解释性。首先,使用快速搜索模型对紧凑的非冗余数据库执行同源延伸,以减少运行时间。其次,我们将对应分析(CA)结合进来作为一种有效的特征降维方法,以生成不同蛋白质类别的清晰可视化分离。最后,通过准确紧凑集(CS)关系和可解释的最近邻(1-NN)算法的组合来预测功能类别。除了定位数据集,我们还应用了人类蛋白激酶数据集来验证我们提出的方法的通用性。实验结果表明,我们的方法以更高效和可解释的方式进行准确的预测。首先,使用紧凑数据库上的快速搜索进行同源延伸可以大大加速传统的运行时间,速度可提高至二十五倍,而不会牺牲预测性能。这表明许多其他也包含同源信息的预测器的计算成本可以大大降低。此外,CA 不仅可以有效地识别判别特征,还可以提供不同功能类别的清晰可视化。此外,基于 CS 的预测可以达到 100%的精度。当与 CS 对未预测目标的 1-NN 结合使用时,与最先进的系统相比,我们的方法在性能上略有提高或相当。