Department of Biostatistics, Harvard University, Boston, MA 02115, USA.
Biostatistics. 2011 Oct;12(4):776-91. doi: 10.1093/biostatistics/kxr012. Epub 2011 Jun 3.
Array-based comparative genomic hybridization (aCGH) enables the measurement of DNA copy number across thousands of locations in a genome. The main goals of analyzing aCGH data are to identify the regions of copy number variation (CNV) and to quantify the amount of CNV. Although there are many methods for analyzing single-sample aCGH data, the analysis of multi-sample aCGH data is a relatively new area of research. Further, many of the current approaches for analyzing multi-sample aCGH data do not appropriately utilize the additional information present in the multiple samples. We propose a procedure called the Fused Lasso Latent Feature Model (FLLat) that provides a statistical framework for modeling multi-sample aCGH data and identifying regions of CNV. The procedure involves modeling each sample of aCGH data as a weighted sum of a fixed number of features. Regions of CNV are then identified through an application of the fused lasso penalty to each feature. Some simulation analyses show that FLLat outperforms single-sample methods when the simulated samples share common information. We also propose a method for estimating the false discovery rate. An analysis of an aCGH data set obtained from human breast tumors, focusing on chromosomes 8 and 17, shows that FLLat and Significance Testing of Aberrant Copy number (an alternative, existing approach) identify similar regions of CNV that are consistent with previous findings. However, through the estimated features and their corresponding weights, FLLat is further able to discern specific relationships between the samples, for example, identifying 3 distinct groups of samples based on their patterns of CNV for chromosome 17.
基于阵列的比较基因组杂交(aCGH)可测量基因组中数千个位置的 DNA 拷贝数。分析 aCGH 数据的主要目标是识别拷贝数变异(CNV)区域,并量化 CNV 的数量。虽然有许多方法可用于分析单样本 aCGH 数据,但多样本 aCGH 数据的分析是一个相对较新的研究领域。此外,目前许多用于分析多样本 aCGH 数据的方法并没有适当地利用多个样本中存在的附加信息。我们提出了一种称为融合套索潜在特征模型(FLLat)的程序,该程序为多样本 aCGH 数据建模和识别 CNV 区域提供了一个统计框架。该程序涉及将每个 aCGH 数据样本建模为固定数量特征的加权和。然后通过对每个特征应用融合套索惩罚来识别 CNV 区域。一些模拟分析表明,当模拟样本共享共同信息时,FLLat 优于单样本方法。我们还提出了一种估计错误发现率的方法。对来自人类乳腺癌肿瘤的 aCGH 数据集的分析,重点关注染色体 8 和 17,表明 FLLat 和异常拷贝数的显着性检验(一种替代的现有方法)识别出与先前发现一致的 CNV 区域。然而,通过估计的特征及其相应的权重,FLLat 还能够辨别样本之间的特定关系,例如,根据染色体 17 的 CNV 模式识别出 3 个不同的样本组。