Hao Yangyang, Xuei Xiaoling, Li Lang, Nakshatri Harikrishna, Edenberg Howard J, Liu Yunlong
1 Department of Medical and Molecular Genetics, Indiana University School of Medicine , Indianapolis, Indiana.
2 Center for Computational Biology and Bioinformatics, Indiana University School of Medicine , Indianapolis, Indiana.
J Comput Biol. 2017 Jul;24(7):637-646. doi: 10.1089/cmb.2017.0057. Epub 2017 May 25.
Accurate identification of low-frequency somatic point mutations in tumor samples has important clinical utilities. Although high-throughput sequencing technology enables capturing such variants while sequencing primary tumor samples, our ability for accurate detection is compromised when the variant frequency is close to the sequencer error rate. Most current experimental and bioinformatic strategies target mutations with ≥5% allele frequency, which limits our ability to understand the cancer etiology and tumor evolution. We present an experimental and computational modeling framework, RareVar, to reliably identify low-frequency single-nucleotide variants from high-throughput sequencing data under standard experimental protocols. RareVar protocol includes a benchmark design by pooling DNAs from already sequenced individuals at various concentrations to target variants at desired frequencies, 0.5%-3% in our case. By applying a generalized, linear model-based, position-specific error model, followed by machine-learning-based variant calibration, our approach outperforms existing methods. Our method can be applied on most capture and sequencing platforms without modifying the experimental protocol.
准确识别肿瘤样本中的低频体细胞点突变具有重要的临床应用价值。尽管高通量测序技术能够在对原发性肿瘤样本进行测序时捕获此类变异,但当变异频率接近测序仪错误率时,我们的准确检测能力就会受到影响。目前大多数实验和生物信息学策略针对的是等位基因频率≥5%的突变,这限制了我们理解癌症病因和肿瘤进化的能力。我们提出了一个实验和计算建模框架RareVar,以在标准实验方案下从高通量测序数据中可靠地识别低频单核苷酸变异。RareVar方案包括一个基准设计,即通过混合来自已测序个体的不同浓度的DNA来靶向所需频率的变异,在我们的案例中为0.5%-3%。通过应用基于广义线性模型的位置特异性错误模型,随后进行基于机器学习的变异校准,我们的方法优于现有方法。我们的方法可以应用于大多数捕获和测序平台,而无需修改实验方案。