用于对DNA拷贝数数据进行建模的连续指数隐马尔可夫跳跃过程。

A continuous-index hidden Markov jump process for modeling DNA copy number data.

作者信息

Stjernqvist Susann, Rydén Tobias

机构信息

Centre for Mathematical Sciences, Lund University, Box 118, 22100 Lund, Sweden.

出版信息

Biostatistics. 2009 Oct;10(4):773-8. doi: 10.1093/biostatistics/kxp030. Epub 2009 Jul 23.

DOI:10.1093/biostatistics/kxp030

PMID:19628640

Abstract

The number of copies of DNA in human cells can be measured using array comparative genomic hybridization (aCGH), which provides intensity ratios of sample to reference DNA at genomic locations corresponding to probes on a microarray. In the present paper, we devise a statistical model, based on a latent continuous-index Markov jump process, that is aimed to capture certain features of aCGH data, including probes that are unevenly long, unevenly spaced, and overlapping. The model has a continuous state space, with 1 state representing a normal copy number of 2, and the rest of the states being either amplifications or deletions. We adopt a Bayesian approach and apply Markov chain Monte Carlo (MCMC) methods for estimating the parameters and the Markov process. The model can be applied to data from both tiling bacterial artificial chromosome arrays and oligonucleotide arrays. We also compare a model with normal distributed noise to a model with t-distributed noise, showing that the latter is more robust to outliers.

摘要

人类细胞中DNA的拷贝数可以通过阵列比较基因组杂交（aCGH）来测量，该技术可在与微阵列上探针相对应的基因组位置提供样本与参考DNA的强度比。在本文中，我们设计了一种基于潜在连续指数马尔可夫跳跃过程的统计模型，旨在捕捉aCGH数据的某些特征，包括长度不均匀、间距不均匀和重叠的探针。该模型具有连续的状态空间，其中1个状态表示正常拷贝数2，其余状态为扩增或缺失。我们采用贝叶斯方法并应用马尔可夫链蒙特卡罗（MCMC）方法来估计参数和马尔可夫过程。该模型可应用于来自平铺细菌人工染色体阵列和寡核苷酸阵列的数据。我们还将具有正态分布噪声的模型与具有t分布噪声的模型进行了比较，结果表明后者对异常值更具鲁棒性。