Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD, 20892, USA.
BMC Bioinformatics. 2018 Nov 14;19(1):424. doi: 10.1186/s12859-018-2412-y.
Somatic copy number alternation (SCNA) is a common feature of the cancer genome and is associated with cancer etiology and prognosis. The allele-specific SCNA analysis of a tumor sample aims to identify the allele-specific copy numbers of both alleles, adjusting for the ploidy and the tumor purity. Next generation sequencing platforms produce abundant read counts at the base-pair resolution across the exome or whole genome which is susceptible to hypersegmentation, a phenomenon where numerous regions with very short length are falsely identified as SCNA.
We propose hsegHMM, a hidden Markov model approach that accounts for hypersegmentation for allele-specific SCNA analysis. hsegHMM provides statistical inference of copy number profiles by using an efficient E-M algorithm procedure. Through simulation and application studies, we found that hsegHMM handles hypersegmentation effectively with a t-distribution as a part of the emission probability distribution structure and a carefully defined state space. We also compared hsegHMM with FACETS which is a current method for allele-specific SCNA analysis. For the application, we use a renal cell carcinoma sample from The Cancer Genome Atlas (TCGA) study.
We demonstrate the robustness of hsegHMM to hypersegmentation. Furthermore, hsegHMM provides the quantification of uncertainty in identifying allele-specific SCNAs over the entire chromosomes. hsegHMM performs better than FACETS when read depth (coverage) is uneven across the genome.
体细胞拷贝数改变(SCNA)是癌症基因组的一个常见特征,与癌症的病因和预后有关。肿瘤样本的等位基因特异性 SCNA 分析旨在确定两个等位基因的等位基因特异性拷贝数,同时调整倍性和肿瘤纯度。下一代测序平台在整个外显子或全基因组范围内以碱基分辨率产生丰富的读数,容易出现超分割现象,即大量非常短的区域被错误地识别为 SCNA。
我们提出了 hsegHMM,这是一种用于等位基因特异性 SCNA 分析的隐马尔可夫模型方法,可以解决超分割问题。hsegHMM 通过使用高效的 E-M 算法程序来提供拷贝数谱的统计推断。通过模拟和应用研究,我们发现 hsegHMM 通过将 t 分布作为发射概率分布结构的一部分和精心定义的状态空间来有效地处理超分割。我们还将 hsegHMM 与 FACETS 进行了比较,FACETS 是当前用于等位基因特异性 SCNA 分析的方法。在应用中,我们使用了来自癌症基因组图谱(TCGA)研究的肾细胞癌样本。
我们证明了 hsegHMM 对超分割的稳健性。此外,hsegHMM 提供了在整个染色体上识别等位基因特异性 SCNAs 的不确定性的量化。当基因组范围内的读深度(覆盖度)不均匀时,hsegHMM 的性能优于 FACETS。