Xu Kui, Hua Guoying, Wu Mingdi, Zhang Haihang, Liu Jingda, Feng Hu, Zuo Erwei
State Key Laboratory of Genome and Multi-omics Technologies, Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Gene Editing Technologies (Hainan), Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong, China.
Cell Res. 2025 Aug 18. doi: 10.1038/s41422-025-01164-x.
The vast scope but limited-supporting evidence in sequence databases hinders identification of proteins with specific functionality. Here, we experimentally characterized catalytic efficiency, target site window, motif preference, and off-target activity of 1100 apolipoprotein B mRNA-editing enzyme, catalytic polypeptide (APOBEC)-like family cytidine deaminases (CDs) fused with nCas9 in HEK293T cells, thereby generating the largest dataset of experimentally validated functions for a single protein family to date. These data, together with amino acid sequence, three-dimensional structure, and eight additional features, were used to construct a machine learning (ML) model, AlphaCD, which showed high accuracy in predicting catalytic efficiency (0.92) and off-target activity (0.84), as well as target windows (0.73) and catalytic motifs (0.78). We applied the trained model to predict the above catalytic features of 21,335 CDs in Uniprot, and subsampling of 28 CDs further validated its prediction accuracy (0.84, 0.87, 0.75, 0.73, respectively). Alanine scanning-based mutagenesis was then employed to reduce off-targets in one example CD, which produced a remarkably high fidelity, high efficiency cytosine base editor, thus demonstrating AlphaCD application in high-accuracy, high-throughput protein functional characterization, and providing a strategy for accelerated characterization of other proteins.
序列数据库中的范围广泛但支持证据有限,这阻碍了具有特定功能蛋白质的鉴定。在这里,我们通过实验表征了1100种载脂蛋白B信使核糖核酸编辑酶、催化多肽(APOBEC)样家族胞苷脱氨酶(CDs)与nCas9在HEK293T细胞中融合后的催化效率、靶位点窗口、基序偏好和脱靶活性,从而生成了迄今为止单个蛋白质家族经实验验证功能的最大数据集。这些数据与氨基酸序列、三维结构以及其他八个特征一起,被用于构建一个机器学习(ML)模型AlphaCD,该模型在预测催化效率(0.92)、脱靶活性(0.84)以及靶窗口(0.73)和催化基序(0.78)方面显示出高精度。我们应用训练好的模型来预测Uniprot中21335种CDs的上述催化特征,对28种CDs的二次抽样进一步验证了其预测准确性(分别为0.84、0.87、0.75、0.73)。然后采用基于丙氨酸扫描的诱变方法来减少一个示例CD中的脱靶现象,这产生了一个具有极高保真度、高效率的胞嘧啶碱基编辑器,从而证明了AlphaCD在高精度、高通量蛋白质功能表征中的应用,并为加速其他蛋白质表征提供了一种策略。