Barger Jacob, Adhikari Badri
IEEE/ACM Trans Comput Biol Bioinform. 2022 Nov-Dec;19(6):3586-3594. doi: 10.1109/TCBB.2021.3115053. Epub 2022 Dec 8.
Much of the recent success in protein structure prediction has been a result of accurate protein contact prediction-a binary classification problem. Dozens of methods, built from various types of machine learning and deep learning algorithms, have been published over the last two decades for predicting contacts. Recently, many groups, including Google DeepMind, have demonstrated that reformulating the problem as a multi-class classification problem is a more promising direction to pursue. As an alternative approach, we recently proposed real-valued distance predictions, formulating the problem as a regression problem. The nuances of protein 3D structures make this formulation appropriate, allowing predictions to reflect inter-residue distances in nature. Despite these promises, the accurate prediction of real-valued distances remains relatively unexplored; possibly due to classification being better suited to machine and deep learning algorithms.
Can regression methods be designed to predict real-valued distances as precise as binary contacts? To investigate this, we propose multiple novel methods of input label engineering, which is different from feature engineering, with the goal of optimizing the distribution of distances to cater to the loss function of the deep-learning model. Since an important utility of predicted contacts or distances is to build three-dimensional models, we also tested if predicted distances can reconstruct more accurate models than contacts.
Our results demonstrate, for the first time, that deep learning methods for real-valued protein distance prediction can deliver distances as precise as binary classification methods. When using an optimal distance transformation function on the standard PSICOV dataset consisting of 150 representative proteins, the precision of 'top-all' long-range contacts improves from 60.9% to 61.4% when predicting real-valued distances instead of contacts. When building three-dimensional models we observed an average TM-score increase from 0.61 to 0.72, highlighting the advantage of predicting real-valued distances.
蛋白质结构预测最近取得的许多成功都归功于准确的蛋白质接触预测——一个二分类问题。在过去二十年里,基于各种机器学习和深度学习算法已经发表了几十种预测接触的方法。最近,包括谷歌深度思维在内的许多团队都证明,将该问题重新表述为多分类问题是一个更有前景的方向。作为一种替代方法,我们最近提出了实值距离预测,将该问题表述为一个回归问题。蛋白质三维结构的细微差别使得这种表述是合适的,能够让预测反映天然的残基间距离。尽管有这些前景,但实值距离的准确预测仍相对未被探索;这可能是因为分类更适合机器学习和深度学习算法。
能否设计回归方法来像预测二元接触一样精确地预测实值距离?为了研究这一点,我们提出了多种新颖的输入标签工程方法,这与特征工程不同,其目标是优化距离分布以适应深度学习模型的损失函数。由于预测接触或距离的一个重要用途是构建三维模型,我们还测试了预测距离是否能比接触重建更准确的模型。
我们的结果首次表明,用于实值蛋白质距离预测的深度学习方法能够给出与二元分类方法一样精确的距离。在由150个代表性蛋白质组成的标准PSICOV数据集上使用最优距离变换函数时,预测实值距离而非接触时,“所有顶级”长程接触的精度从60.9%提高到了61.4%。在构建三维模型时,我们观察到平均TM分数从0.61提高到了0.72,突出了预测实值距离的优势。