Department of Molecular Oncology Breast Cancer Research Program, British Columbia Cancer Research Centre, Vancouver, BC, Canada.
Bioinformatics. 2010 Mar 15;26(6):730-6. doi: 10.1093/bioinformatics/btq040. Epub 2010 Feb 3.
Next-generation sequencing (NGS) has enabled whole genome and transcriptome single nucleotide variant (SNV) discovery in cancer. NGS produces millions of short sequence reads that, once aligned to a reference genome sequence, can be interpreted for the presence of SNVs. Although tools exist for SNV discovery from NGS data, none are specifically suited to work with data from tumors, where altered ploidy and tumor cellularity impact the statistical expectations of SNV discovery.
We developed three implementations of a probabilistic Binomial mixture model, called SNVMix, designed to infer SNVs from NGS data from tumors to address this problem. The first models allelic counts as observations and infers SNVs and model parameters using an expectation maximization (EM) algorithm and is therefore capable of adjusting to deviation of allelic frequencies inherent in genomically unstable tumor genomes. The second models nucleotide and mapping qualities of the reads by probabilistically weighting the contribution of a read/nucleotide to the inference of a SNV based on the confidence we have in the base call and the read alignment. The third combines filtering out low-quality data in addition to probabilistic weighting of the qualities. We quantitatively evaluated these approaches on 16 ovarian cancer RNASeq datasets with matched genotyping arrays and a human breast cancer genome sequenced to >40x (haploid) coverage with ground truth data and show systematically that the SNVMix models outperform competing approaches.
Software and data are available at http://compbio.bccrc.ca
sshah@bccrc.ca SUPPLEMANTARY INFORMATION: Supplementary data are available at Bioinformatics online.
下一代测序(NGS)使癌症全基因组和转录组单核苷酸变异(SNV)的发现成为可能。NGS 产生了数百万条短序列读段,一旦与参考基因组序列对齐,就可以解释 SNV 的存在。虽然存在用于从 NGS 数据中发现 SNV 的工具,但没有专门针对肿瘤数据的工具,因为肿瘤中的倍性和肿瘤细胞含量会影响 SNV 发现的统计预期。
我们开发了三种概率二项式混合模型的实现,称为 SNVMix,旨在解决这个问题,从肿瘤的 NGS 数据中推断 SNV。第一种模型将等位基因计数作为观测值,并使用期望最大化(EM)算法推断 SNV 和模型参数,因此能够调整基因组不稳定肿瘤基因组中固有的等位基因频率偏差。第二种模型通过根据我们对碱基调用和读取对齐的置信度,概率性地加权读取/核苷酸对 SNV 推断的贡献,来对读取的核苷酸和映射质量进行建模。第三种方法除了对质量进行概率加权外,还过滤掉低质量数据。我们在 16 个卵巢癌 RNAseq 数据集上对这些方法进行了定量评估,这些数据集具有匹配的基因分型阵列,以及一个人类乳腺癌基因组测序到 >40x(单倍体)覆盖度,具有真实数据,并系统地表明 SNVMix 模型优于竞争方法。
软件和数据可在 http://compbio.bccrc.ca 获得。
补充数据可在生物信息学在线获得。