Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan.
Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan.
PLoS One. 2020 Jun 30;15(6):e0235153. doi: 10.1371/journal.pone.0235153. eCollection 2020.
The secondary structure prediction of proteins is a classic topic of computational structural biology with a variety of applications. During the past decade, the accuracy of prediction achieved by state-of-the-art algorithms has been >80%; meanwhile, the time cost of prediction increased rapidly because of the exponential growth of fundamental protein sequence data. Based on literature studies and preliminary observations on the relationships between the size/homology of the fundamental protein dataset and the speed/accuracy of predictions, we raised two hypotheses that might be helpful to determine the main influence factors of the efficiency of secondary structure prediction. Experimental results of size and homology reductions of the fundamental protein dataset supported those hypotheses. They revealed that shrinking the size of the dataset could substantially cut down the time cost of prediction with a slight decrease of accuracy, which could be increased on the contrary by homology reduction of the dataset. Moreover, the Shannon information entropy could be applied to explain how accuracy was influenced by the size and homology of the dataset. Based on these findings, we proposed that a proper combination of size and homology reductions of the protein dataset could speed up the secondary structure prediction while preserving the high accuracy of state-of-the-art algorithms. Testing the proposed strategy with the fundamental protein dataset of the year 2018 provided by the Universal Protein Resource, the speed of prediction was enhanced over 20 folds while all accuracy measures remained equivalently high. These findings are supposed helpful for improving the efficiency of researches and applications depending on the secondary structure prediction of proteins. To make future implementations of the proposed strategy easy, we have established a database of size and homology reduced protein datasets at http://10.life.nctu.edu.tw/UniRefNR.
蛋白质的二级结构预测是计算结构生物学的一个经典课题,具有多种应用。在过去的十年中,最先进算法的预测准确性已经超过 80%;与此同时,由于基本蛋白质序列数据的指数级增长,预测的时间成本迅速增加。基于文献研究和对基本蛋白质数据集的大小/同源性与预测速度/准确性之间关系的初步观察,我们提出了两个假设,这些假设可能有助于确定二级结构预测效率的主要影响因素。对基本蛋白质数据集大小和同源性减少的实验结果支持了这些假设。它们表明,缩小数据集的大小可以大大减少预测的时间成本,而准确性略有下降,相反,通过数据集的同源性减少可以提高准确性。此外,香农信息熵可用于解释数据集的大小和同源性如何影响准确性。基于这些发现,我们提出了一种适当的蛋白质数据集大小和同源性减少的组合,可以在保持最先进算法的高精度的同时加快二级结构预测。使用通用蛋白质资源提供的 2018 年基本蛋白质数据集测试所提出的策略,预测速度提高了 20 多倍,而所有准确性度量仍然保持相当高的水平。这些发现有助于提高依赖蛋白质二级结构预测的研究和应用的效率。为了使所提出策略的未来实现变得简单,我们在 http://10.life.nctu.edu.tw/UniRefNR 建立了一个大小和同源性减少的蛋白质数据集数据库。