Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan.
Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan.
PLoS One. 2021 Jul 14;16(7):e0254555. doi: 10.1371/journal.pone.0254555. eCollection 2021.
The secondary structure prediction (SSP) of proteins has long been an essential structural biology technique with various applications. Despite its vital role in many research and industrial fields, in recent years, as the accuracy of state-of-the-art secondary structure predictors approaches the theoretical upper limit, SSP has been considered no longer challenging or too challenging to make advances. With the belief that the substantial improvement of SSP will move forward many fields depending on it, we conducted this study, which focused on three issues that have not been noticed or thoroughly examined yet but may have affected the reliability of the evaluation of previous SSP algorithms. These issues are all about the sequence homology between or within the developmental and evaluation datasets. We thus designed many different homology layouts of datasets to train and evaluate SSP prediction models. Multiple repeats were performed in each experiment by random sampling. The conclusions obtained with small experimental datasets were verified with large-scale datasets using state-of-the-art SSP algorithms. Very different from the long-established assumption, we discover that the sequence homology between query datasets for training, testing, and independent tests exerts little influence on SSP accuracy. Besides, the sequence homology redundancy between or within most datasets would make the accuracy of an SSP algorithm overestimated, while the redundancy within the reference dataset for extracting predictive features would make the accuracy underestimated. Since the overestimating effects are more significant than the underestimating effect, the accuracy of some SSP methods might have been overestimated. Based on the discoveries, we propose a rigorous procedure for developing SSP algorithms and making reliable evaluations, hoping to bring substantial improvements to future SSP methods and benefit all research and application fields relying on accurate prediction of protein secondary structures.
蛋白质的二级结构预测(SSP)一直是一项重要的结构生物学技术,具有多种应用。尽管它在许多研究和工业领域都发挥了重要作用,但近年来,随着最先进的二级结构预测器的准确性接近理论上限,SSP 被认为不再具有挑战性,或者太难取得进展。我们相信,SSP 的实质性改进将推动许多依赖于它的领域取得进展,因此进行了这项研究。本研究集中于三个尚未被注意到或彻底检查过但可能影响以前 SSP 算法评估可靠性的问题。这些问题都与发展和评估数据集之间或内部的序列同源性有关。因此,我们设计了许多不同的数据集同源性布局来训练和评估 SSP 预测模型。在每个实验中,通过随机抽样进行多次重复。用小实验数据集获得的结论,使用最先进的 SSP 算法在大规模数据集上进行了验证。与长期以来的假设非常不同的是,我们发现,用于训练、测试和独立测试的查询数据集之间的序列同源性对 SSP 准确性几乎没有影响。此外,大多数数据集之间或内部的序列同源性冗余会高估 SSP 算法的准确性,而从提取预测特征的参考数据集中的冗余会低估 SSP 算法的准确性。由于高估效应比低估效应更为显著,因此一些 SSP 方法的准确性可能被高估了。基于这些发现,我们提出了一种严格的开发 SSP 算法和进行可靠评估的程序,希望为未来的 SSP 方法带来实质性的改进,并使所有依赖于准确预测蛋白质二级结构的研究和应用领域受益。