Center for Uterine Cancer Diagnosis and Therapy Research of Zhejiang Province, Women's Reproductive Health Key Laboratory of Zhejiang Province, Women's Hospital and Institute of Translational Medicine, Zhejiang University School of Medicine, Hangzhou, Zhejiang 310006, China.
Sir Run Run Shaw Hospital and Institute of Translational Medicine, Zhejiang University School of Medicine, Hangzhou, Zhejiang 310016, China.
Nucleic Acids Res. 2019 May 7;47(8):e45. doi: 10.1093/nar/gkz096.
Although rapid progress has been made in computational approaches for prioritizing cancer driver genes, research is far from achieving the ultimate goal of discovering a complete catalog of genes truly associated with cancer. Driver gene lists predicted from these computational tools lack consistency and are prone to false positives. Here, we developed an approach (DriverML) integrating Rao's score test and supervised machine learning to identify cancer driver genes. The weight parameters in the score statistics quantified the functional impacts of mutations on the protein. To obtain optimized weight parameters, the score statistics of prior driver genes were maximized on pan-cancer training data. We conducted rigorous and unbiased benchmark analysis and comparisons of DriverML with 20 other existing tools in 31 independent datasets from The Cancer Genome Atlas (TCGA). Our comprehensive evaluations demonstrated that DriverML was robust and powerful among various datasets and outperformed the other tools with a better balance of precision and sensitivity. In vitro cell-based assays further proved the validity of the DriverML prediction of novel driver genes. In summary, DriverML uses an innovative, machine learning-based approach to prioritize cancer driver genes and provides dramatic improvements over currently existing methods. Its source code is available at https://github.com/HelloYiHan/DriverML.
尽管在优先考虑癌症驱动基因的计算方法方面取得了快速进展,但研究远未达到发现与癌症真正相关的完整基因目录的最终目标。这些计算工具预测的驱动基因列表缺乏一致性,并且容易出现假阳性。在这里,我们开发了一种方法(DriverML),该方法集成了 Rao 的得分检验和监督机器学习,以识别癌症驱动基因。得分统计中的权重参数量化了突变对蛋白质的功能影响。为了获得优化的权重参数,在泛癌训练数据上最大化了先前驱动基因的得分统计。我们在来自癌症基因组图谱(TCGA)的 31 个独立数据集上进行了严格和无偏的基准分析和与 20 个其他现有工具的比较。我们的综合评估表明,DriverML 在各种数据集之间是稳健且强大的,并且在精度和灵敏度之间具有更好的平衡,优于其他工具。基于细胞的体外测定进一步证明了 DriverML 预测新的驱动基因的有效性。总之,DriverML 使用创新的基于机器学习的方法来优先考虑癌症驱动基因,并显著优于现有的方法。其源代码可在 https://github.com/HelloYiHan/DriverML 上获得。