研究基于深度学习的结构预测能力,以推断和/或丰富抗体 CDR 典型结构的集合。
Investigating the ability of deep learning-based structure prediction to extrapolate and/or enrich the set of antibody CDR canonical forms.
机构信息
Oxford Protein Informatics Group, Department of Statistics, University of Oxford, Oxford, United Kingdom.
出版信息
Front Immunol. 2024 Feb 28;15:1352703. doi: 10.3389/fimmu.2024.1352703. eCollection 2024.
Deep learning models have been shown to accurately predict protein structure from sequence, allowing researchers to explore protein space from the structural viewpoint. In this paper we explore whether "novel" features, such as distinct loop conformations can arise from these predictions despite not being present in the training data. Here we have used ABodyBuilder2, a deep learning antibody structure predictor, to predict the structures of ~1.5M paired antibody sequences. We examined the predicted structures of the canonical CDR loops and found that most of these predictions fall into the already described CDR canonical form structural space. We also found a small number of "new" canonical clusters composed of heterogeneous sequences united by a common sequence motif and loop conformation. Analysis of these novel clusters showed their origins to be either shapes seen in the training data at very low frequency or shapes seen at high frequency but at a shorter sequence length. To evaluate explicitly the ability of ABodyBuilder2 to extrapolate, we retrained several models whilst withholding all antibody structures of a specific CDR loop length or canonical form. These "starved" models showed evidence of generalisation across CDRs of different lengths, but they did not extrapolate to loop conformations which were highly distinct from those present in the training data. However, the models were able to accurately predict a canonical form even if only a very small number of examples of that shape were in the training data. Our results suggest that deep learning protein structure prediction methods are unable to make completely out-of-domain predictions for CDR loops. However, in our analysis we also found that even minimal amounts of data of a structural shape allow the method to recover its original predictive abilities. We have made the ~1.5 M predicted structures used in this study available to download at https://doi.org/10.5281/zenodo.10280181.
深度学习模型已被证明可以从序列中准确预测蛋白质结构,使研究人员能够从结构角度探索蛋白质空间。在本文中,我们探讨了即使在训练数据中不存在,这些预测是否也能产生“新颖”的特征,例如独特的环构象。在这里,我们使用了一种深度学习抗体结构预测器 ABodyBuilder2,来预测约 150 万对抗体序列的结构。我们检查了典型 CDR 环的预测结构,发现这些预测大多数都落入了已经描述的 CDR 典型结构空间。我们还发现了一小部分由共同序列模体和环构象连接的异质序列组成的“新”典型簇。对这些新簇的分析表明,它们的起源要么是在训练数据中以极低的频率看到的形状,要么是在高频但较短序列长度下看到的形状。为了明确评估 ABodyBuilder2 的外推能力,我们在不保留特定 CDR 环长度或典型形式的所有抗体结构的情况下,重新训练了几个模型。这些“饥饿”模型显示出跨不同长度 CDR 进行泛化的证据,但它们无法外推到与训练数据中存在的环构象高度不同的构象。然而,即使在训练数据中只有非常少数的这种形状的例子,该模型也能够准确地预测典型形式。我们的结果表明,深度学习蛋白质结构预测方法无法对 CDR 环进行完全非领域的预测。然而,在我们的分析中,我们还发现,即使是少量的结构形状数据,也允许该方法恢复其原始预测能力。我们已经将本研究中使用的约 1500 万预测结构可在 https://doi.org/10.5281/zenodo.10280181 下载。