IEEE/ACM Trans Comput Biol Bioinform. 2020 Sep-Oct;17(5):1648-1659. doi: 10.1109/TCBB.2019.2911609. Epub 2019 Apr 16.
The order of amino acids in a protein sequence enables the protein to acquire a conformation suitable for performing functions, thereby motivating the need to analyze these sequences for predicting functions. Although machine learning based approaches are fast compared to methods using BLAST, FASTA, etc., they fail to perform well for long protein sequences (with more than 300 amino acids). In this paper, we introduce a novel method for construction of two separate feature sets for protein using bi-directional long short-term memory network based on the analysis of fixed 1) single-sized segments and 2) multi-sized segments. The model trained on the proposed feature set based on multi-sized segments is combined with the model trained using state-of-the-art Multi-label Linear Discriminant Analysis (MLDA) features to further improve the accuracy. Extensive evaluations using separate datasets for biological processes and molecular functions demonstrate not only improved results for long sequences, but also significantly improve the overall accuracy over state-of-the-art method. The single-sized approach produces an improvement of +3.37 percent for biological processes and +5.48 percent for molecular functions over the MLDA based classifier. The corresponding numbers for multi-sized approach are +5.38 and +8.00 percent. Combining the two models, the accuracy further improves to +7.41 and +9.21 percent, respectively.
蛋白质序列中氨基酸的顺序使蛋白质能够获得适合执行功能的构象,从而促使人们需要分析这些序列以预测功能。虽然基于机器学习的方法比使用 BLAST、FASTA 等的方法快,但它们在处理长蛋白质序列(超过 300 个氨基酸)时表现不佳。在本文中,我们提出了一种新的方法,使用基于双向长短时记忆网络的固定 1)单一大小段和 2)多大小段分析来构建蛋白质的两个单独特征集。基于多大小段的提出的特征集上训练的模型与使用最先进的多标签线性判别分析 (MLDA) 特征训练的模型相结合,以进一步提高准确性。使用生物过程和分子功能的单独数据集进行的广泛评估不仅证明了长序列的结果得到了改善,而且还显著提高了整体准确性超过最先进的方法。单一大小方法在生物过程中产生了+3.37%的改进,在分子功能方面产生了+5.48%的改进,而基于 MLDA 的分类器的相应数字为+5.38%和+8.00%。将两个模型结合起来,准确性分别进一步提高到+7.41%和+9.21%。