Venkatraman E S, Olshen Adam B
Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, 1275 York Avenue, New York, NY 10021, USA.
Bioinformatics. 2007 Mar 15;23(6):657-63. doi: 10.1093/bioinformatics/btl646. Epub 2007 Jan 18.
Array CGH technologies enable the simultaneous measurement of DNA copy number for thousands of sites on a genome. We developed the circular binary segmentation (CBS) algorithm to divide the genome into regions of equal copy number. The algorithm tests for change-points using a maximal t-statistic with a permutation reference distribution to obtain the corresponding P-value. The number of computations required for the maximal test statistic is O(N2), where N is the number of markers. This makes the full permutation approach computationally prohibitive for the newer arrays that contain tens of thousands markers and highlights the need for a faster algorithm.
We present a hybrid approach to obtain the P-value of the test statistic in linear time. We also introduce a rule for stopping early when there is strong evidence for the presence of a change. We show through simulations that the hybrid approach provides a substantial gain in speed with only a negligible loss in accuracy and that the stopping rule further increases speed. We also present the analyses of array CGH data from breast cancer cell lines to show the impact of the new approaches on the analysis of real data.
An R version of the CBS algorithm has been implemented in the "DNAcopy" package of the Bioconductor project. The proposed hybrid method for the P-value is available in version 1.2.1 or higher and the stopping rule for declaring a change early is available in version 1.5.1 or higher.
阵列比较基因组杂交(Array CGH)技术能够同时测量基因组上千个位点的DNA拷贝数。我们开发了循环二元分割(CBS)算法,将基因组划分为拷贝数相等的区域。该算法使用带有置换参考分布的最大t统计量来检验变化点,以获得相应的P值。最大检验统计量所需的计算量为O(N2),其中N是标记的数量。这使得全置换方法对于包含数万个标记的新型阵列在计算上难以实现,凸显了对更快算法的需求。
我们提出了一种混合方法,能在线性时间内获得检验统计量的P值。我们还引入了一条规则,当有强有力的证据表明存在变化时提前停止。通过模拟我们表明,混合方法在速度上有显著提升,而准确性仅有可忽略不计的损失,并且停止规则进一步提高了速度。我们还展示了对乳腺癌细胞系阵列CGH数据的分析,以说明新方法对实际数据分析的影响。
CBS算法的R版本已在Bioconductor项目的“DNAcopy”包中实现。用于P值的拟议混合方法在1.2.1或更高版本中可用,提前声明变化的停止规则在1.5.1或更高版本中可用。