Department of Statistics, Stanford University, Stanford, CA, USA.
Bioinformatics. 2010 Jan 15;26(2):153-60. doi: 10.1093/bioinformatics/btp653. Epub 2009 Nov 20.
DNA copy number variants (CNVs) are gains and losses of segments of chromosomes, and comprise an important class of genetic variation. Recently, various microarray hybridization-based techniques have been developed for high-throughput measurement of DNA copy number. In many studies, multiple technical platforms or different versions of the same platform were used to interrogate the same samples; and it became necessary to pool information across these multiple sources to derive a consensus molecular profile for each sample. An integrated analysis is expected to maximize resolution and accuracy, yet currently there is no well-formulated statistical method to address the between-platform differences in probe coverage, assay methods, sensitivity and analytical complexity.
The conventional approach is to apply one of the CNV detection ('segmentation') algorithms to search for DNA segments of altered signal intensity. The results from multiple platforms are combined after segmentation. Here we propose a new method, Multi-Platform Circular Binary Segmentation (MPCBS), which pools statistical evidence across platforms during segmentation, and does not require pre-standardization of different data sources. It involves a weighted sum of t-statistics, which arises naturally from the generalized log-likelihood ratio of a multi-platform model. We show by comparing the integrated analysis of Affymetrix and Illumina SNP array data with Agilent and fosmid clone end-sequencing results on eight HapMap samples that MPCBS achieves improved spatial resolution, detection power and provides a natural consensus across platforms. We also apply the new method to analyze multi-platform data for tumor samples.
The R package for MPCBS is registered on R-Forge (http://r-forge.r-project.org/) under project name MPCBS.
Supplementary data are available at Bioinformatics online.
DNA 拷贝数变异(CNV)是染色体片段的增益和缺失,构成了遗传变异的一个重要类别。最近,各种基于微阵列杂交的技术已经被开发出来,用于高通量测量 DNA 拷贝数。在许多研究中,使用了多种技术平台或同一平台的不同版本来检测相同的样本;因此,有必要从这些多个来源汇集信息,为每个样本得出一致的分子图谱。集成分析有望最大限度地提高分辨率和准确性,但目前还没有制定出良好的统计方法来解决探针覆盖、检测方法、灵敏度和分析复杂性等方面的平台间差异。
传统的方法是应用一种 CNV 检测(“分割”)算法来搜索信号强度改变的 DNA 片段。在分割后将来自多个平台的结果进行组合。在这里,我们提出了一种新的方法,即多平台循环二进制分割(MPCBS),该方法在分割过程中跨平台汇集统计证据,并且不需要对不同数据源进行预标准化。它涉及到平台之间 t 统计量的加权和,这是多平台模型的广义对数似然比自然产生的。我们通过比较 8 个 HapMap 样本上 Affymetrix 和 Illumina SNP 芯片数据的集成分析与 Agilent 和 fosmid 克隆末端测序结果,表明 MPCBS 实现了改进的空间分辨率、检测能力,并在平台间提供了自然的共识。我们还将新方法应用于分析肿瘤样本的多平台数据。
MPCBS 的 R 包已在 R-Forge(http://r-forge.r-project.org/)上注册,项目名为 MPCBS。
补充数据可在 Bioinformatics 在线获取。