Department of Computational Biology, Cornell University, Ithaca, 14853, USA.
Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, 14853, USA.
Interdiscip Sci. 2024 Dec;16(4):802-813. doi: 10.1007/s12539-024-00639-6. Epub 2024 Aug 19.
X-ray diffraction crystallography has been most widely used for protein three-dimensional (3D) structure determination for which whether proteins are crystallizable is a central prerequisite. Yet, there are a number of procedures during protein crystallization, including protein material production, purification, and crystal production, which take turns affecting the crystallization outcome. Due to the expensive and laborious nature of this multi-stage process, various computational tools have been developed to predict protein crystallization propensity, which is then used to guide the experimental determination. In this study, we presented a novel deep learning framework, PLMC, to improve multi-stage protein crystallization propensity prediction by leveraging a pre-trained protein language model. To effectively train PLMC, two groups of features of each protein were integrated into a more comprehensive representation, including protein language embeddings from the large-scale protein sequence database and a handcrafted feature set consisting of physicochemical, sequence-based and disordered-related information. These features were further separately embedded for refinement, and then concatenated for the final prediction. Notably, our extensive benchmarking tests demonstrate that PLMC greatly outperforms other state-of-the-art methods by achieving AUC scores of 0.773, 0.893, and 0.913, respectively, at the aforementioned individual stages, and 0.982 at the final crystallization stage. Furthermore, PLMC is shown to be superior for predicting the crystallization of both globular and membrane proteins, as demonstrated by an AUC score of 0.991 for the latter. These results suggest the significant potential of PLMC in assisting researchers with the experimental design of crystallizable protein variants.
X 射线晶体学已被广泛应用于蛋白质三维(3D)结构的测定,而蛋白质是否可结晶是其核心前提条件。然而,在蛋白质结晶过程中有许多步骤,包括蛋白质材料的生产、纯化和晶体生产,这些步骤依次影响结晶结果。由于这个多阶段过程昂贵且费力,因此已经开发了各种计算工具来预测蛋白质结晶倾向,然后将其用于指导实验测定。在这项研究中,我们提出了一种新颖的深度学习框架 PLMC,通过利用预先训练好的蛋白质语言模型来提高多阶段蛋白质结晶倾向预测。为了有效地训练 PLMC,我们将每组蛋白质的两类特征整合到一个更全面的表示中,包括来自大规模蛋白质序列数据库的蛋白质语言嵌入和由物理化学、序列和无序相关信息组成的手工制作特征集。这些特征被进一步分别嵌入以进行细化,然后拼接用于最终预测。值得注意的是,我们广泛的基准测试表明,PLMC 大大优于其他最先进的方法,在上述各个阶段的 AUC 得分分别达到 0.773、0.893 和 0.913,而在最终结晶阶段的得分则达到 0.982。此外,PLMC 还被证明在预测球状蛋白和膜蛋白的结晶方面具有优势,后者的 AUC 得分达到 0.991。这些结果表明 PLMC 在协助研究人员设计可结晶的蛋白质变体方面具有重要的潜力。