Center for Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA.
Proc Natl Acad Sci U S A. 2011 Nov 15;108(46):E1128-36. doi: 10.1073/pnas.1110574108. Epub 2011 Nov 7.
DNA copy number variations (CNVs) play an important role in the pathogenesis and progression of cancer and confer susceptibility to a variety of human disorders. Array comparative genomic hybridization has been used widely to identify CNVs genome wide, but the next-generation sequencing technology provides an opportunity to characterize CNVs genome wide with unprecedented resolution. In this study, we developed an algorithm to detect CNVs from whole-genome sequencing data and applied it to a newly sequenced glioblastoma genome with a matched control. This read-depth algorithm, called BIC-seq, can accurately and efficiently identify CNVs via minimizing the Bayesian information criterion. Using BIC-seq, we identified hundreds of CNVs as small as 40 bp in the cancer genome sequenced at 10× coverage, whereas we could only detect large CNVs (> 15 kb) in the array comparative genomic hybridization profiles for the same genome. Eighty percent (14/16) of the small variants tested (110 bp to 14 kb) were experimentally validated by quantitative PCR, demonstrating high sensitivity and true positive rate of the algorithm. We also extended the algorithm to detect recurrent CNVs in multiple samples as well as deriving error bars for breakpoints using a Gibbs sampling approach. We propose this statistical approach as a principled yet practical and efficient method to estimate CNVs in whole-genome sequencing data.
DNA 拷贝数变异 (CNVs) 在癌症的发病机制和进展中起着重要作用,并使人类易患多种疾病。阵列比较基因组杂交已广泛用于全基因组范围内识别 CNVs,但下一代测序技术提供了一个机会,以空前的分辨率全基因组范围内描述 CNVs。在这项研究中,我们开发了一种从全基因组测序数据中检测 CNVs 的算法,并将其应用于一个新测序的胶质母细胞瘤基因组和一个匹配的对照。这种称为 BIC-seq 的读取深度算法可以通过最小化贝叶斯信息准则来准确有效地识别 CNVs。使用 BIC-seq,我们在 10×覆盖的癌症基因组测序中识别出了数百个小至 40bp 的 CNVs,而在相同基因组的阵列比较基因组杂交图谱中,我们只能检测到大的 CNVs(>15kb)。经过定量 PCR 实验验证,80%(14/16)的小变体(110bp 到 14kb)的检测结果是正确的,证明了该算法的高灵敏度和真阳性率。我们还扩展了该算法,以检测多个样本中的复发性 CNVs,并使用 Gibbs 抽样方法为断点推导误差条。我们提出了这种统计方法,作为一种有原则但实用且高效的方法,用于估计全基因组测序数据中的 CNVs。