Singh Jaspreet, Litfin Thomas, Paliwal Kuldip, Singh Jaswinder, Hanumanthappa Anil Kumar, Zhou Yaoqi
Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD 4111, Australia.
School of Information and Communication Technology, Griffith University, Southport, QLD 4222, Australia.
Bioinformatics. 2021 Oct 25;37(20):3464-3472. doi: 10.1093/bioinformatics/btab316.
Knowing protein secondary and other one-dimensional structural properties are essential for accurate protein structure and function prediction. As a result, many methods have been developed for predicting these one-dimensional structural properties. However, most methods relied on evolutionary information that may not exist for many proteins due to a lack of sequence homologs. Moreover, it is computationally intensive for obtaining evolutionary information as the library of protein sequences continues to expand exponentially. Here, we developed a new single-sequence method called SPOT-1D-Single based on a large training dataset of 39 120 proteins deposited prior to 2016 and an ensemble of hybrid long-short-term-memory bidirectional neural network and convolutional neural network.
We showed that SPOT-1D-Single consistently improves over SPIDER3-Single and ProteinUnet for secondary structure, solvent accessibility, contact number and backbone angles prediction for all seven independent test sets (TEST2018, SPOT-2016, SPOT-2016-HQ, SPOT-2018, SPOT-2018-HQ, CASP12 and CASP13 free-modeling targets). For example, the predicted three-state secondary structure's accuracy ranges from 72.12% to 74.28% by SPOT-1D-Single, compared to 69.1-72.6% by SPIDER3-Single and 70.6-73% by ProteinUnet. SPOT-1D-Single also predicts SS3 and SS8 with 6.24% and 6.98% better accuracy than SPOT-1D on SPOT-2018 proteins with no homologs (Neff = 1), respectively. The new method's improvement over existing techniques is due to a larger training set combined with ensembled learning.
Standalone-version of SPOT-1D-Single is available at https://github.com/jas-preet/SPOT-1D-Single. Direct prediction can also be made at https://sparks-lab.org/server/spot-1d-single. The datasets used in this research can also be downloaded from GitHub.
Supplementary data are available at Bioinformatics online.
了解蛋白质二级结构和其他一维结构特性对于准确预测蛋白质结构和功能至关重要。因此,已经开发了许多方法来预测这些一维结构特性。然而,大多数方法依赖于进化信息,由于缺乏序列同源物,许多蛋白质可能不存在这种信息。此外,随着蛋白质序列库呈指数级持续扩展,获取进化信息的计算量很大。在此,我们基于2016年之前存入的39120个蛋白质的大型训练数据集以及混合长短时记忆双向神经网络和卷积神经网络的集成,开发了一种名为SPOT-1D-Single的新单序列方法。
我们表明,对于所有七个独立测试集(TEST2018、SPOT-2016、SPOT-2016-HQ、SPOT-2018、SPOT-2018-HQ、CASP12和CASP13自由建模目标),SPOT-1D-Single在二级结构、溶剂可及性、接触数和主链角度预测方面始终优于SPIDER3-Single和ProteinUnet。例如,SPOT-1D-Single预测的三态二级结构准确率在72.12%至74.28%之间,而SPIDER3-Single为69.1 - 72.6%,ProteinUnet为70.6 - 73%。在没有同源物(有效数量=1)的SPOT-2018蛋白质上,SPOT-1D-Single预测SS3和SS8的准确率分别比SPOT-1D高6.24%和6.98%。新方法相对于现有技术的改进归因于更大的训练集与集成学习相结合。
SPOT-1D-Single的独立版本可在https://github.com/jas-preet/SPOT-1D-Single获取。也可在https://sparks-lab.org/server/spot-1d-single进行直接预测。本研究中使用的数据集也可从GitHub下载。
补充数据可在《生物信息学》在线获取。