Graduate Program in Quantitative and Computational Biosciences, Baylor College of Medicine, Houston, TX, 77030, USA.
Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA.
Genome Biol. 2020 Feb 20;21(1):43. doi: 10.1186/s13059-020-01954-z.
The initiation and subsequent evolution of cancer are largely driven by a relatively small number of somatic mutations with critical functional impacts, so-called driver mutations. Identifying driver mutations in a patient's tumor cells is a central task in the era of precision cancer medicine. Over the decade, many computational algorithms have been developed to predict the effects of missense single-nucleotide variants, and they are frequently employed to prioritize mutation candidates. These algorithms employ diverse molecular features to build predictive models, and while some algorithms are cancer-specific, others are not. However, the relative performance of these algorithms has not been rigorously assessed.
We construct five complementary benchmark datasets: mutation clustering patterns in the protein 3D structures, literature annotation based on OncoKB, TP53 mutations based on their effects on target-gene transactivation, effects of cancer mutations on tumor formation in xenograft experiments, and functional annotation based on in vitro cell viability assays we developed including a new dataset of ~ 200 mutations. We evaluate the performance of 33 algorithms and found that CHASM, CTAT-cancer, DEOGEN2, and PrimateAI show consistently better performance than the other algorithms. Moreover, cancer-specific algorithms show much better performance than those designed for a general purpose.
Our study is a comprehensive assessment of the performance of different algorithms in predicting cancer driver mutations and provides deep insights into the best practice of computationally prioritizing cancer mutation candidates for end-users and for the future development of new algorithms.
癌症的发生和后续发展在很大程度上是由少数具有关键功能影响的体细胞突变驱动的,这些突变被称为驱动突变。在精准癌症医学时代,鉴定患者肿瘤细胞中的驱动突变是一项核心任务。在过去的十年中,已经开发出许多用于预测错义单核苷酸变异影响的计算算法,并经常用于优先考虑突变候选者。这些算法使用各种分子特征来构建预测模型,虽然有些算法是针对癌症的,而有些则不是。然而,这些算法的相对性能尚未得到严格评估。
我们构建了五个互补的基准数据集:蛋白质 3D 结构中的突变聚类模式、基于 OncoKB 的文献注释、基于对靶基因反式激活影响的 TP53 突变、癌症突变对异种移植实验中肿瘤形成的影响,以及基于我们开发的体外细胞活力测定的功能注释,包括一个包含约 200 个突变的新数据集。我们评估了 33 种算法的性能,发现 CHASM、CTAT-cancer、DEOGEN2 和 PrimateAI 的性能始终优于其他算法。此外,癌症特异性算法的性能明显优于那些通用算法。
我们的研究全面评估了不同算法在预测癌症驱动突变方面的性能,并为最终用户计算优先考虑癌症突变候选者提供了深入的见解,也为新算法的未来发展提供了参考。