使用稳健的隐马尔可夫模型将拷贝数多态性整合到阵列比较基因组杂交分析中。
Integrating copy number polymorphisms into array CGH analysis using a robust HMM.
作者信息
Shah Sohrab P, Xuan Xiang, DeLeeuw Ron J, Khojasteh Mehrnoush, Lam Wan L, Ng Raymond, Murphy Kevin P
机构信息
Department of Computer Science, University of British Columbia, 201-2366 Main Mall Vancouver, BC V6T 1Z4 Canada.
出版信息
Bioinformatics. 2006 Jul 15;22(14):e431-9. doi: 10.1093/bioinformatics/btl238.
MOTIVATION
Array comparative genomic hybridization (aCGH) is a pervasive technique used to identify chromosomal aberrations in human diseases, including cancer. Aberrations are defined as regions of increased or decreased DNA copy number, relative to a normal sample. Accurately identifying the locations of these aberrations has many important medical applications. Unfortunately, the observed copy number changes are often corrupted by various sources of noise, making the boundaries hard to detect. One popular current technique uses hidden Markov models (HMMs) to divide the signal into regions of constant copy number called segments; a subsequent classification phase labels each segment as a gain, a loss or neutral. Unfortunately, standard HMMs are sensitive to outliers, causing over-segmentation, where segments erroneously span very short regions.
RESULTS
We propose a simple modification that makes the HMM robust to such outliers. More importantly, this modification allows us to exploit prior knowledge about the likely location of "outliers", which are often due to copy number polymorphisms (CNPs). By "explaining away" these outliers with prior knowledge about the locations of CNPs, we can focus attention on the more clinically relevant aberrated regions. We show significant improvements over the current state of the art technique (DNAcopy with MergeLevels) on previously published data from mantle cell lymphoma cell lines, and on published benchmark synthetic data augmented with outliers.
AVAILABILITY
Source code written in Matlab is available from http://www.cs.ubc.ca/~sshah/acgh.
动机
阵列比较基因组杂交(aCGH)是一种广泛应用于识别包括癌症在内的人类疾病中染色体畸变的技术。畸变被定义为相对于正常样本而言DNA拷贝数增加或减少的区域。准确识别这些畸变的位置具有许多重要的医学应用。不幸的是,观察到的拷贝数变化常常受到各种噪声源的干扰,使得边界难以检测。当前一种流行的技术使用隐马尔可夫模型(HMM)将信号划分为称为片段的恒定拷贝数区域;随后的分类阶段将每个片段标记为增益、缺失或中性。不幸的是,标准的HMM对异常值敏感,会导致过度分割现象,即片段错误地跨越非常短的区域。
结果
我们提出了一种简单的修改方法,使HMM对这类异常值具有鲁棒性。更重要的是,这种修改使我们能够利用关于“异常值”可能位置的先验知识,这些异常值通常是由于拷贝数多态性(CNP)引起的。通过利用关于CNP位置的先验知识“排除”这些异常值,我们可以将注意力集中在更具临床相关性的畸变区域上。我们在先前发表的套细胞淋巴瘤细胞系数据以及添加了异常值的已发表基准合成数据上,相对于当前的先进技术(带有MergeLevels的DNAcopy)显示出显著的改进。
可用性
用Matlab编写的源代码可从http://www.cs.ubc.ca/~sshah/acgh获取。