Department of Computer Science, Yonsei University, Seoul, South Korea.
PLoS One. 2011;6(10):e26975. doi: 10.1371/journal.pone.0026975. Epub 2011 Oct 31.
It is difficult to identify copy number variations (CNV) in normal human genomic data due to noise and non-linear relationships between different genomic regions and signal intensity. A high-resolution array comparative genomic hybridization (aCGH) containing 42 million probes, which is very large compared to previous arrays, was recently published. Most existing CNV detection algorithms do not work well because of noise associated with the large amount of input data and because most of the current methods were not designed to analyze normal human samples. Normal human genome analysis often requires a joint approach across multiple samples. However, the majority of existing methods can only identify CNVs from a single sample.
We developed a multi-sample-based genomic variations detector (MGVD) that uses segmentation to identify common breakpoints across multiple samples and a k-means-based clustering strategy. Unlike previous methods, MGVD simultaneously considers multiple samples with different genomic intensities and identifies CNVs and CNV zones (CNVZs); CNVZ is a more precise measure of the location of a genomic variant than the CNV region (CNVR).
We designed a specialized algorithm to detect common CNVs from extremely high-resolution multi-sample aCGH data. MGVD showed high sensitivity and a low false discovery rate for a simulated data set, and outperformed most current methods when real, high-resolution HapMap datasets were analyzed. MGVD also had the fastest runtime compared to the other algorithms evaluated when actual, high-resolution aCGH data were analyzed. The CNVZs identified by MGVD can be used in association studies for revealing relationships between phenotypes and genomic aberrations. Our algorithm was developed with standard C++ and is available in Linux and MS Windows format in the STL library. It is freely available at: http://embio.yonsei.ac.kr/~Park/mgvd.php.
由于噪声和不同基因组区域与信号强度之间的非线性关系,在正常人类基因组数据中识别拷贝数变异(CNV)较为困难。最近发表了一种高分辨率的阵列比较基因组杂交(aCGH),包含 4200 万个探针,与之前的阵列相比非常大。由于与大量输入数据相关的噪声,以及大多数当前方法并非专门为分析正常人类样本而设计,大多数现有的 CNV 检测算法都不能很好地工作。正常人类基因组分析通常需要跨多个样本的联合方法。然而,大多数现有的方法只能从单个样本中识别 CNV。
我们开发了一种基于多样本的基因组变异检测器(MGVD),该检测器使用分割来识别多个样本中的常见断点,并使用基于 K-均值的聚类策略。与以前的方法不同,MGVD 同时考虑具有不同基因组强度的多个样本,并识别 CNV 和 CNV 区(CNVZ);与 CNV 区域(CNVR)相比,CNVZ 是基因组变异位置的更精确度量。
我们设计了一种专门的算法,用于从超高分辨率多样本 aCGH 数据中检测常见的 CNV。MGVD 在模拟数据集上表现出高灵敏度和低假阳性率,在分析真实的高分辨率 HapMap 数据集时优于大多数当前方法。与评估的其他算法相比,当分析实际的高分辨率 aCGH 数据时,MGVD 的运行时间也最快。MGVD 识别的 CNVZ 可用于关联研究,以揭示表型与基因组异常之间的关系。我们的算法是用标准 C++开发的,可在 Linux 和 MS Windows 格式的 STL 库中使用。它可在以下网址免费获得:http://embio.yonsei.ac.kr/~Park/mgvd.php。