Patiyal Sumeet, Dhall Anjali, Raghava Gajendra P S
Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi 110020, India.
Biol Methods Protoc. 2022 May 27;7(1):bpac012. doi: 10.1093/biomethods/bpac012. eCollection 2022.
Identification of somatic mutations with high precision is one of the major challenges in the prediction of high-risk liver cancer patients. In the past, number of mutations calling techniques has been developed that include MuTect2, MuSE, Varscan2, and SomaticSniper. In this study, an attempt has been made to benchmark the potential of these techniques in predicting the prognostic biomarkers for liver cancer. Initially, we extracted somatic mutations in liver cancer patients using Variant Call Format (VCF) and Mutation Annotation Format (MAF) files from the cancer genome atlas. In terms of size, the MAF files are 42 times smaller than VCF files and containing only high-quality somatic mutations. Furthermore, machine learning-based models have been developed for predicting high-risk cancer patients using mutations obtained from different techniques. The performance of different techniques and data files has been compared based on their potential to discriminate high- and low-risk liver cancer patients. Based on correlation analysis, we selected 80 genes having significant negative correlation with the overall survival of liver cancer patients. The univariate survival analysis revealed the prognostic role of highly mutated genes. Single gene-based analysis showed that MuTect2 technique-based MAF file has achieved maximum hazard ratio (HR) of 9.25 with -value of 1.78E-06. Further, we developed various prediction models using risk-associated top-10 genes for each technique. Our results indicate that MuTect2 technique-based VCF files outperform all other methods with maximum Area Under the Receiver-Operating Characteristic curve of 0.765 and HR = 4.50 (-value = 3.83E-15). Eventually, VCF file generated using MuTect2 technique performs better among other mutation calling techniques for the prediction of high-risk liver cancer patients. We hope that our findings will provide a useful and comprehensive comparison of various mutation-calling techniques for the prognostic analysis of cancer patients. In order to serve the scientific community, we have provided a Python-based pipeline to develop the prediction models using mutation profiles (VCF/MAF) of cancer patients. It is available on GitHub at https://github.com/raghavagps/mutation_bench.
高精度识别体细胞突变是预测高危肝癌患者的主要挑战之一。过去,已经开发了多种突变检测技术,包括MuTect2、MuSE、Varscan2和SomaticSniper。在本研究中,我们尝试对这些技术在预测肝癌预后生物标志物方面的潜力进行基准测试。最初,我们使用来自癌症基因组图谱的变异调用格式(VCF)和突变注释格式(MAF)文件提取肝癌患者的体细胞突变。在大小方面,MAF文件比VCF文件小42倍,并且只包含高质量的体细胞突变。此外,我们还开发了基于机器学习的模型,使用从不同技术获得的突变来预测高危癌症患者。基于区分高危和低危肝癌患者的潜力,对不同技术和数据文件的性能进行了比较。通过相关性分析,我们选择了80个与肝癌患者总生存期具有显著负相关的基因。单变量生存分析揭示了高度突变基因的预后作用。基于单基因的分析表明,基于MuTect2技术的MAF文件实现了最大风险比(HR)为9.25,P值为1.78E-06。此外,我们为每种技术使用与风险相关的前10个基因开发了各种预测模型。我们的结果表明,基于MuTect2技术的VCF文件在所有其他方法中表现最佳,受试者操作特征曲线下的最大面积为0.765,HR = 4.50(P值 = 3.83E-15)。最终,在预测高危肝癌患者方面,使用MuTect2技术生成的VCF文件在其他突变检测技术中表现更好。我们希望我们的发现将为癌症患者预后分析的各种突变检测技术提供有用且全面的比较。为了服务科学界,我们提供了一个基于Python的管道,用于使用癌症患者的突变谱(VCF/MAF)开发预测模型。它可在GitHub上获取,网址为https://github.com/raghavagps/mutation_bench 。