Banerjee Anupam, Bogetti Anthony T, Bahar Ivet
Laufer Center for Physical and Quantitative Biology, Stony Brook University, Stony Brook, NY 11794.
Department of Biochemistry and Cell Biology, Renaissance School of Medicine, Stony Brook University, Stony Brook, NY 11794.
Proc Natl Acad Sci U S A. 2025 May 6;122(18):e2418100122. doi: 10.1073/pnas.2418100122. Epub 2025 May 2.
Understanding the effects of missense mutations or single amino acid variants (SAVs) on protein function is crucial for elucidating the molecular basis of diseases/disorders and designing rational therapies. We introduce here , a machine learning tool for discriminating pathogenic and neutral SAVs, significantly expanding on a precursor limited by the availability of structural data. With the advent of AlphaFold2 as a powerful tool for structure prediction, is trained on a significantly expanded dataset of 117,525 SAVs corresponding to 12,094 human proteins reported in the ClinVar database. Adopting a broad set of descriptors composed of sequence evolutionary, structural, dynamic, and energetics features in the training algorithm, achieved an AUROC of 0.94 in 10-fold cross-validation when all SAVs of a particular test protein (mutant) were excluded from the training set. Benchmarking against a variety of testing datasets demonstrated the high performance of . While sequence evolutionary descriptors play a dominant role in pathogenicity prediction, those based on structural dynamics provide a mechanistic interpretation. Notably, residues involved in allosteric communication and those distinguished by pronounced fluctuations in the high-frequency modes of motion or subject to spatial constraints in soft modes usually give rise to pathogenicity when mutated. Overall, provides an efficient and transparent tool for accurately predicting the pathogenicity of SAVs and unraveling the mechanistic basis of the observed behavior, thus advancing our understanding of genotype-to-phenotype relations.
了解错义突变或单氨基酸变体(SAVs)对蛋白质功能的影响对于阐明疾病/病症的分子基础和设计合理的治疗方法至关重要。我们在此介绍一种用于区分致病性和中性SAVs的机器学习工具,它在很大程度上扩展了受结构数据可用性限制的前身工具。随着AlphaFold2作为一种强大的结构预测工具的出现,该工具在ClinVar数据库中报告的对应于12,094种人类蛋白质的117,525个SAVs的显著扩展数据集上进行了训练。在训练算法中采用由序列进化、结构、动力学和能量学特征组成的广泛描述符集,当将特定测试蛋白质(突变体)的所有SAVs从训练集中排除时,该工具在10折交叉验证中实现了0.94的曲线下面积(AUROC)。针对各种测试数据集的基准测试证明了该工具的高性能。虽然序列进化描述符在致病性预测中起主导作用,但基于结构动力学的描述符提供了一种机理解释。值得注意的是,参与变构通讯的残基以及那些在高频运动模式中表现出明显波动或在软模式中受到空间限制的残基在发生突变时通常会导致致病性。总体而言,该工具为准确预测SAVs的致病性和揭示观察到的行为的机理解释提供了一种高效且透明的工具,从而推进了我们对基因型与表型关系的理解。