Hsu Fang-Han, Chen Hung-I H, Tsai Mong-Hsun, Lai Liang-Chuan, Huang Chi-Cheng, Tu Shih-Hsin, Chuang Eric Y, Chen Yidong
Graduate Institute of Biomedical Electronics and Bioinformatics, Department of Electrical Engineering, National Taiwan University, Taipei 106, Taiwan.
BMC Res Notes. 2011 Oct 10;4:394. doi: 10.1186/1756-0500-4-394.
Circular Binary Segmentation (CBS) is a permutation-based algorithm for array Comparative Genomic Hybridization (aCGH) data analysis. CBS accurately segments data by detecting change-points using a maximal-t test; but extensive computational burden is involved for evaluating the significance of change-points using permutations. A recent implementation utilizing a hybrid method and early stopping rules (hybrid CBS) to improve the performance in speed was subsequently proposed. However, a time analysis revealed that a major portion of computation time of the hybrid CBS was still spent on permutation. In addition, what the hybrid method provides is an approximation of the significance upper bound or lower bound, not an approximation of the significance of change-points itself.
We developed a novel model-based algorithm, extreme-value based CBS (eCBS), which limits permutations and provides robust results without loss of accuracy. Thousands of aCGH data under null hypothesis were simulated in advance based on a variety of non-normal assumptions, and the corresponding maximal-t distribution was modeled by the Generalized Extreme Value (GEV) distribution. The modeling results, which associate characteristics of aCGH data to the GEV parameters, constitute lookup tables (eXtreme model). Using the eXtreme model, the significance of change-points could be evaluated in a constant time complexity through a table lookup process.
A novel algorithm, eCBS, was developed in this study. The current implementation of eCBS consistently outperforms the hybrid CBS 4× to 20× in computation time without loss of accuracy. Source codes, supplementary materials, supplementary figures, and supplementary tables can be found at http://ntumaps.cgm.ntu.edu.tw/eCBSsupplementary.
循环二元分割(CBS)是一种用于阵列比较基因组杂交(aCGH)数据分析的基于排列的算法。CBS通过使用最大t检验检测变化点来准确分割数据;但使用排列评估变化点的显著性涉及大量计算负担。最近提出了一种利用混合方法和早期停止规则的实现方式(混合CBS)来提高速度性能。然而,时间分析表明,混合CBS的大部分计算时间仍花费在排列上。此外,混合方法提供的是显著性上限或下限的近似值,而不是变化点本身显著性的近似值。
我们开发了一种基于模型的新算法,基于极值的CBS(eCBS),它限制排列并提供稳健结果且不损失准确性。基于各种非正态假设预先模拟了数千个零假设下的aCGH数据,并通过广义极值(GEV)分布对相应的最大t分布进行建模。将aCGH数据特征与GEV参数相关联的建模结果构成查找表(极值模型)。使用该极值模型,通过查表过程可以在恒定时间复杂度内评估变化点的显著性。
本研究开发了一种新算法eCBS。当前eCBS的实现方式在计算时间上始终比混合CBS快4倍到20倍,且不损失准确性。源代码、补充材料、补充图和补充表可在http://ntumaps.cgm.ntu.edu.tw/eCBSsupplementary上找到。