Guha Subharup, Li Yi, Neuberg Donna
Department of Statistics, University of Missouri-Columbia, Columbia, MO 65211.
J Am Stat Assoc. 2008 Jun 1;103(482):485-497. doi: 10.1198/016214507000000923.
Genomic alterations have been linked to the development and progression of cancer. The technique of comparative genomic hybridization (CGH) yields data consisting of fluorescence intensity ratios of test and reference DNA samples. The intensity ratios provide information about the number of copies in DNA. Practical issues such as the contamination of tumor cells in tissue specimens and normalization errors necessitate the use of statistics for learning about the genomic alterations from array CGH data. As increasing amounts of array CGH data become available, there is a growing need for automated algorithms for characterizing genomic profiles. Specifically, there is a need for algorithms that can identify gains and losses in the number of copies based on statistical considerations, rather than merely detect trends in the data.We adopt a Bayesian approach, relying on the hidden Markov model to account for the inherent dependence in the intensity ratios. Posterior inferences are made about gains and losses in copy number. Localized amplifications (associated with oncogene mutations) and deletions (associated with mutations of tumor suppressors) are identified using posterior probabilities. Global trends such as extended regions of altered copy number are detected. Because the posterior distribution is analytically intractable, we implement a Metropolis-within-Gibbs algorithm for efficient simulation-based inference. Publicly available data on pancreatic adenocarcinoma, glioblastoma multiforme, and breast cancer are analyzed, and comparisons are made with some widely used algorithms to illustrate the reliability and success of the technique.
基因组改变与癌症的发生和发展有关。比较基因组杂交(CGH)技术产生的数据由测试DNA样本和参考DNA样本的荧光强度比率组成。强度比率提供了有关DNA中拷贝数的信息。诸如组织标本中肿瘤细胞的污染和标准化误差等实际问题使得有必要使用统计学方法从阵列CGH数据中了解基因组改变。随着越来越多的阵列CGH数据可用,对用于表征基因组图谱的自动化算法的需求也在增加。具体而言,需要能够基于统计考虑来识别拷贝数增加和减少的算法,而不仅仅是检测数据中的趋势。我们采用贝叶斯方法,依靠隐马尔可夫模型来考虑强度比率中的内在依赖性。对拷贝数的增加和减少进行后验推断。使用后验概率识别局部扩增(与癌基因突变相关)和缺失(与肿瘤抑制基因突变相关)。检测到诸如拷贝数改变的扩展区域等全局趋势。由于后验分布在分析上难以处理,我们实现了一种Gibbs抽样中的Metropolis算法,用于基于模拟的高效推断。对公开可用的胰腺癌、多形性胶质母细胞瘤和乳腺癌数据进行了分析,并与一些广泛使用的算法进行了比较,以说明该技术的可靠性和成功性。