Wang Wenyi, Carvalho Benilton, Miller Nathaniel D, Pevsner Jonathan, Chakravarti Aravinda, Irizarry Rafael A
Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA.
J Comput Biol. 2008 Sep;15(7):857-66. doi: 10.1089/cmb.2007.0148.
Genomic changes such as copy number alterations are one of the major underlying causes of human phenotypic variation among normal and disease subjects. Array comparative genomic hybridization (CGH) technology was developed to detect copy number changes in a high-throughput fashion. However, this technology provides only a >30-kb resolution, which limits the ability to detect copy number alterations spanning small regions. Higher resolution technologies such as single nucleotide polymorphism (SNP) microarrays allow detection of copy number alterations at least as small as several thousand base pairs. Unfortunately, strong probe effects and variation introduced by sample preparation procedures have made single-point copy number estimates too imprecise to be useful. Various groups have proposed statistical procedures that pool data from neighboring locations to successfully improve precision. However, these procedure need to average across relatively large regions to work effectively, thus greatly reducing resolution. Recently, regression-type models that account for probe effects have been proposed and appear to improve accuracy as well as precision. In this paper, we propose a mixture model solution, specifically designed for single-point estimation, that provides various advantages over the existing methodology. We use a 314-sample database, to motivate and fit models for the conditional distribution of the observed intensities given allele-specific copy number. We can then compute posterior probabilities that provide a useful prediction rule as well as a confidence measure for each call. Software to implement this procedure will be available in the Bioconductor oligo package (www.bioconductor.org).
诸如拷贝数改变之类的基因组变化是正常人和疾病患者之间人类表型变异的主要潜在原因之一。为了以高通量方式检测拷贝数变化,人们开发了阵列比较基因组杂交(CGH)技术。然而,该技术仅提供大于30 kb的分辨率,这限制了检测跨越小区域的拷贝数改变的能力。诸如单核苷酸多态性(SNP)微阵列等更高分辨率的技术能够检测至少小至几千个碱基对的拷贝数改变。不幸的是,强烈的探针效应以及样本制备过程引入的变异使得单点拷贝数估计过于不精确而无法使用。各个研究团队已经提出了一些统计方法,这些方法通过汇总相邻位置的数据来成功提高精度。然而,这些方法需要在相对较大的区域进行平均才能有效工作,从而大大降低了分辨率。最近,已经提出了考虑探针效应的回归类型模型,这些模型似乎提高了准确性和精度。在本文中,我们提出了一种专门为单点估计设计的混合模型解决方案,该方案相对于现有方法具有多种优势。我们使用一个包含314个样本的数据库,来激发并拟合给定等位基因特异性拷贝数的观察强度的条件分布的模型。然后,我们可以计算后验概率,这些概率提供了一个有用的预测规则以及对每个调用的置信度度量。实现此过程的软件将在Bioconductor的oligo包(www.bioconductor.org)中提供。