Dumas Noé, Portelli Geoffrey, Ji Yang, Dupont Florent, Jendoubi Mehdi, Lalli Enzo
Thales SA, Thales Services Numériques, 06560 Valbonne-Sophia Antipolis, France.
Centre National de la Recherche Scientifique, I nstitut de Pharmacologie Moléculaire et Cellulaire, 06560 Valbonne-Sophia Antipolis, France.
NAR Genom Bioinform. 2025 Apr 22;7(2):lqaf047. doi: 10.1093/nargab/lqaf047. eCollection 2025 Jun.
AlphaMissense is a valuable resource for discerning important functional regions within proteins, providing pathogenicity heatmaps that highlight the pathogenic risk of specific mutations along the protein sequence. However, due to protein folding and long-range interactions, the actual structural alterations with functional implications may be occurring at a distance from the mutation site. As a result, the identification of the most sensitive structural regions for protein function may be hampered by the presence of mutations that indirectly affect the critical regions from a distance. In this study, we illustrate how the use of AlphaMissense predictions to train an XGBoost regression model on structural features extracted from the structures of protein variants predicted by OmegaFold enables the definition of a new explainability metric: a residue-based importance score that highlights the most critical structural domains within a protein sequence. To verify the accuracy of our approach, we applied it to the extensively studied protein DAX-1 and successfully identified critical structural domains. Notably, as this score only requires knowledge of the protein's amino acid sequence, it is valuable in guiding experimental investigations aimed at discovering functionally crucial regions in proteins that have been poorly characterized.
AlphaMissense是识别蛋白质中重要功能区域的宝贵资源,它提供致病性热图,突出显示特定突变沿蛋白质序列的致病风险。然而,由于蛋白质折叠和长程相互作用,具有功能影响的实际结构改变可能发生在距突变位点一定距离处。因此,蛋白质功能最敏感结构区域的识别可能会受到远距离间接影响关键区域的突变的阻碍。在本研究中,我们展示了如何利用AlphaMissense预测,基于从OmegaFold预测的蛋白质变体结构中提取的结构特征训练XGBoost回归模型,从而定义一种新的可解释性指标:基于残基的重要性得分,该得分突出显示蛋白质序列中最关键的结构域。为了验证我们方法的准确性,我们将其应用于广泛研究的蛋白质DAX-1,并成功识别出关键结构域。值得注意的是,由于该得分仅需要蛋白质的氨基酸序列信息,因此在指导旨在发现特征描述不足的蛋白质中功能关键区域的实验研究方面具有重要价值。