结合进化与蛋白质语言模型，利用D2Deep进行可解释的癌症驱动基因突变预测。

Combining evolution and protein language models for an interpretable cancer driver mutation prediction with D2Deep.

作者信息

Tzavella Konstantina, Diaz Adrian, Olsen Catharina, Vranken Wim

机构信息

Interuniversity Institute of Bioinformatics (IB2), Université Libre de Bruxelles, Vrije Universiteit Brussel (ULB-VUB), Triomflaan, Brussels 1050, Belgium.

Brussels Interuniversity Genomics High Throughput Core (BRIGHTcore), Vrije Universiteit Brussel (VUB), Université Libre de Bruxelles (ULB), Laarbeeklaan 101, Brussels 1090, Belgium.

出版信息

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae664.

DOI:10.1093/bib/bbae664

PMID:39708841

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11663023/

Abstract

The mutations driving cancer are being increasingly exposed through tumor-specific genomic data. However, differentiating between cancer-causing driver mutations and random passenger mutations remains challenging. State-of-the-art homology-based predictors contain built-in biases and are often ill-suited to the intricacies of cancer biology. Protein language models have successfully addressed various biological problems but have not yet been tested on the challenging task of cancer driver mutation prediction at a large scale. Additionally, they often fail to offer result interpretation, hindering their effective use in clinical settings. The AI-based D2Deep method we introduce here addresses these challenges by combining two powerful elements: (i) a nonspecialized protein language model that captures the makeup of all protein sequences and (ii) protein-specific evolutionary information that encompasses functional requirements for a particular protein. D2Deep relies exclusively on sequence information, outperforms state-of-the-art predictors, and captures intricate epistatic changes throughout the protein caused by mutations. These epistatic changes correlate with known mutations in the clinical setting and can be used for the interpretation of results. The model is trained on a balanced, somatic training set and so effectively mitigates biases related to hotspot mutations compared to state-of-the-art techniques. The versatility of D2Deep is illustrated by its performance on non-cancer mutation prediction, where most variants still lack known consequences. D2Deep predictions and confidence scores are available via https://tumorscope.be/d2deep to help with clinical interpretation and mutation prioritization.

摘要

通过肿瘤特异性基因组数据，驱动癌症的突变正越来越多地被揭示出来。然而，区分致癌驱动突变和随机乘客突变仍然具有挑战性。基于同源性的先进预测器存在内在偏差，往往不适用于癌症生物学的复杂性。蛋白质语言模型已经成功解决了各种生物学问题，但尚未在大规模癌症驱动突变预测这一具有挑战性的任务上进行测试。此外，它们常常无法提供结果解释，阻碍了其在临床环境中的有效应用。我们在此介绍的基于人工智能的D2Deep方法通过结合两个强大的要素来应对这些挑战：（i）一个非专门的蛋白质语言模型，它捕捉所有蛋白质序列的组成；（ii）特定于蛋白质的进化信息，其中包含特定蛋白质的功能要求。D2Deep仅依赖序列信息，优于先进的预测器，并捕捉由突变引起的整个蛋白质中复杂的上位性变化。这些上位性变化与临床环境中已知的突变相关，可用于结果解释。该模型在一个平衡的体细胞训练集上进行训练，因此与先进技术相比，能有效减轻与热点突变相关的偏差。D2Deep在非癌症突变预测方面的表现说明了其通用性，在非癌症突变预测中，大多数变异的后果仍不明确。可通过https://tumorscope.be/d2deep获取D2Deep的预测结果和置信度分数，以帮助进行临床解释和突变优先级排序。