Department of Applied Informatics, Silesian University of Technology, Akademicka 16, 44-100, Gliwice, Poland.
Department of Bioinformatics and Telemedicine, Jagiellonian University Medical College, Medyczna 7, 30-688, Kraków, Poland.
BMC Bioinformatics. 2022 Mar 22;23(1):100. doi: 10.1186/s12859-022-04623-z.
The prediction of protein secondary structures is a crucial and significant step for ab initio tertiary structure prediction which delivers the information about proteins activity and functions. As the experimental methods are expensive and sometimes impossible, many SS predictors, mainly based on different machine learning methods have been proposed for many years. Currently, most of the top methods use evolutionary-based input features produced by PSSM and HHblits software, although quite recently the embeddings-the new description of protein sequences generated by language models (LM) have appeared that could be leveraged as input features. Apart from input features calculation, the top models usually need extensive computational resources for training and prediction and are barely possible to run on a regular PC. SS prediction as the imbalanced classification problem should not be judged by the commonly used Q3/Q8 metrics. Moreover, as the benchmark datasets are not random samples, the classical statistical null hypothesis testing based on the Neyman-Pearson approach is not appropriate.
We present a lightweight deep network ProteinUnet2 for SS prediction which is based on U-Net convolutional architecture and evolutionary-based input features (from PSSM and HHblits) as well as SPOT-Contact features. Through an extensive evaluation study, we report the performance of ProteinUnet2 in comparison with top SS prediction methods based on evolutionary information (SAINT and SPOT-1D). We also propose a new statistical methodology for prediction performance assessment based on the significance from Fisher-Pitman permutation tests accompanied by practical significance measured by Cohen's effect size.
Our results suggest that ProteinUnet2 architecture has much shorter training and inference times while maintaining results similar to SAINT and SPOT-1D predictors. Taking into account the relatively long times of calculating evolutionary-based features (from PSSM in particular), it would be worth conducting the predictive ability tests on embeddings as input features in the future. We strongly believe that our proposed here statistical methodology for the evaluation of SS prediction results will be adopted and used (and even expanded) by the research community.
蛋白质二级结构预测是从头预测蛋白质三级结构的关键步骤,它提供了关于蛋白质活性和功能的信息。由于实验方法昂贵且有时不可行,多年来,许多主要基于不同机器学习方法的 SS 预测器已经被提出。目前,大多数顶级方法都使用基于进化的输入特征,这些特征是由 PSSM 和 HHblits 软件产生的,尽管最近出现了蛋白质序列的新描述,即语言模型(LM)生成的嵌入,可以作为输入特征加以利用。除了输入特征的计算之外,顶级模型通常需要大量的计算资源进行训练和预测,几乎不可能在普通 PC 上运行。SS 预测作为不平衡分类问题,不应该用常用的 Q3/Q8 指标来判断。此外,由于基准数据集不是随机样本,基于 Neyman-Pearson 方法的经典统计零假设检验并不适用。
我们提出了一种轻量级的深度网络 ProteinUnet2,用于 SS 预测,它基于 U-Net 卷积架构和基于进化的输入特征(来自 PSSM 和 HHblits)以及 SPOT-Contact 特征。通过广泛的评估研究,我们报告了 ProteinUnet2 与基于进化信息(SAINT 和 SPOT-1D)的顶级 SS 预测方法的性能比较。我们还提出了一种新的统计方法,用于基于 Fisher-Pitman 置换检验的显著性评估,并结合 Cohen 的效应大小来衡量实际意义。
我们的结果表明,ProteinUnet2 架构的训练和推断时间更短,同时保持与 SAINT 和 SPOT-1D 预测器相似的结果。考虑到计算进化基特征(特别是 PSSM)的时间相对较长,未来值得将嵌入作为输入特征进行预测能力测试。我们坚信,我们在这里提出的 SS 预测结果评估的统计方法将被研究社区采用和使用(甚至扩展)。