David R. Cheriton School of Computer Science, University of Waterloo, 200 University Ave W, Waterloo, ON N2L 3G1, Canada.
School of Computer Science and Engineering, Central South University, Changsha 410083, China.
Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac352.
The X-ray diffraction (XRD) technique based on crystallography is the main experimental method to analyze the three-dimensional structure of proteins. The production process of protein crystals on which the XRD technique relies has undergone multiple experimental steps, which requires a lot of manpower and material resources. In addition, studies have shown that not all proteins can form crystals under experimental conditions, and the success rate of the final crystallization of proteins is only <10%. Although some protein crystallization predictors have been developed, not many tools capable of predicting multi-stage protein crystallization propensity are available and the accuracy of these tools is not satisfactory. In this paper, we propose a novel deep learning framework, named SADeepcry, for predicting protein crystallization propensity. The framework can be used to estimate the three steps (protein material production, purification and crystallization) in protein crystallization experiments and the success rate of the final protein crystallization. SADeepcry uses the optimized self-attention and auto-encoder modules to extract sequence, structure and physicochemical features from the proteins. Compared with other state-of-the-art protein crystallization propensity prediction models, SADeepcry can obtain more complex global spatial long-distance dependence of protein sequence information. Our computational results show that SADeepcry has increased Matthews correlation coefficient and area under the curve, by 100.3% and 13.4%, respectively, over the DCFCrystal method on the benchmark dataset. The codes of SADeepcry are available at https://github.com/zhc940702/SADeepcry.
X 射线衍射(XRD)技术基于晶体学,是分析蛋白质三维结构的主要实验方法。蛋白质晶体的产生过程经历了多个实验步骤,需要大量的人力和物力。此外,研究表明并非所有蛋白质在实验条件下都能形成晶体,蛋白质最终结晶的成功率仅<10%。尽管已经开发出一些蛋白质结晶预测器,但能够预测多阶段蛋白质结晶倾向的工具并不多,而且这些工具的准确性并不令人满意。在本文中,我们提出了一种名为 SADeepcry 的新型深度学习框架,用于预测蛋白质结晶倾向。该框架可用于估计蛋白质结晶实验中的三个步骤(蛋白质材料生产、纯化和结晶)以及最终蛋白质结晶的成功率。SADeepcry 使用优化的自注意力和自动编码器模块从蛋白质中提取序列、结构和物理化学特征。与其他最先进的蛋白质结晶倾向预测模型相比,SADeepcry 可以获得更复杂的全局空间长程蛋白质序列信息的依赖性。我们的计算结果表明,在基准数据集上,SADeepcry 在 DCFCrystal 方法的基础上,Matthews 相关系数和曲线下面积分别提高了 100.3%和 13.4%。SADeepcry 的代码可在 https://github.com/zhc940702/SADeepcry 上获得。