Zhou Bingxin, Zheng Lirong, Wu Banghao, Yi Kai, Zhong Bozitao, Tan Yang, Liu Qian, Liò Pietro, Hong Liang
Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, China.
Shanghai National Center for Applied Mathematics (SJTU center), Shanghai Jiao Tong University, Shanghai, China.
Cell Discov. 2024 Sep 10;10(1):95. doi: 10.1038/s41421-024-00728-2.
Deep learning-based methods for generating functional proteins address the growing need for novel biocatalysts, allowing for precise tailoring of functionalities to meet specific requirements. This advancement leads to the development of highly efficient and specialized proteins with diverse applications across scientific, technological, and biomedical fields. This study establishes a pipeline for protein sequence generation with a conditional protein diffusion model, namely CPDiffusion, to create diverse sequences of proteins with enhanced functions. CPDiffusion accommodates protein-specific conditions, such as secondary structures and highly conserved amino acids. Without relying on extensive training data, CPDiffusion effectively captures highly conserved residues and sequence features for specific protein families. We applied CPDiffusion to generate artificial sequences of Argonaute (Ago) proteins based on the backbone structures of wild-type (WT) Kurthia massiliensis Ago (KmAgo) and Pyrococcus furiosus Ago (PfAgo), which are complex multi-domain programmable endonucleases. The generated sequences deviate by up to nearly 400 amino acids from their WT templates. Experimental tests demonstrated that the majority of the generated proteins for both KmAgo and PfAgo show unambiguous activity in DNA cleavage, with many of them exhibiting superior activity as compared to the WT. These findings underscore CPDiffusion's remarkable success rate in generating novel sequences for proteins with complex structures and functions in a single step, leading to enhanced activity. This approach facilitates the design of enzymes with multi-domain molecular structures and intricate functions through in silico generation and screening, all accomplished without the need for supervision from labeled data.
基于深度学习的功能蛋白生成方法满足了对新型生物催化剂日益增长的需求,能够精确定制功能以满足特定要求。这一进展推动了高效且专业化的蛋白质的开发,这些蛋白质在科学、技术和生物医学领域有着广泛应用。本研究利用条件蛋白扩散模型CPDiffusion建立了蛋白质序列生成流程,以创建具有增强功能的多样化蛋白质序列。CPDiffusion考虑了蛋白质特异性条件,如二级结构和高度保守的氨基酸。在不依赖大量训练数据的情况下,CPDiffusion有效地捕捉特定蛋白质家族的高度保守残基和序列特征。我们应用CPDiffusion基于野生型马赛库特氏菌Ago(KmAgo)和激烈热球菌Ago(PfAgo)的主干结构生成Argonaute(Ago)蛋白的人工序列,这两种蛋白都是复杂的多结构域可编程核酸内切酶。生成的序列与其野生型模板相比,最多相差近400个氨基酸。实验测试表明,KmAgo和PfAgo生成的大多数蛋白质在DNA切割中表现出明确的活性,其中许多与野生型相比表现出更高的活性。这些发现强调了CPDiffusion在一步生成具有复杂结构和功能的蛋白质新序列方面的显著成功率,从而提高了活性。这种方法通过计算机生成和筛选促进了具有多结构域分子结构和复杂功能的酶的设计,所有这些都无需标记数据的监督即可完成。