Department of Radiology, Harvard School of Public Health, Boston, MA 02115, USA.
Bioinformatics. 2011 Sep 1;27(17):2338-45. doi: 10.1093/bioinformatics/btr402. Epub 2011 Jul 12.
Human genomic variability occurs at different scales, from single nucleotide polymorphisms (SNPs) to large DNA segments. Copy number variations (CNVs) represent a significant part of our genetic heterogeneity and have also been associated with many diseases and disorders. Short, localized CNVs, which may play an important role in human disease, may be undetectable in noisy genomic data. Therefore, robust methodologies are needed for their detection. Furthermore, for meaningful identification of pathological CNVs, estimation of normal allelic aberrations is necessary.
We developed a signal processing-based methodology for sequence denoising followed by pattern matching, to increase SNR in genomic data and improve CNV detection. We applied this signal-decomposition-matched filtering (SDMF) methodology to 429 normal genomic sequences, and compared detected CNVs to those in the Database of Genomic Variants. SDMF successfully detected a significant number of previously identified CNVs with frequencies of occurrence ≥10%, as well as unreported short CNVs. Its performance was also compared to circular binary segmentation (CBS). through simulations. SDMF had a significantly lower false detection rate and was significantly faster than CBS, an important advantage for handling large datasets generated with high-resolution arrays. By focusing on improving SNR (instead of the robustness of the detection algorithm), SDMF is a very promising methodology for identifying CNVs at all genomic spatial scales.
The data are available at http://tcga-data.nci.nih.gov/tcga/ The software and list of analyzed sequence IDs are available at http://www.hsph.harvard.edu/~betensky/ A Matlab code for Empirical Mode Decomposition may be found at: http://www.clear.rice.edu/elec301/Projects02/empiricalMode/code.html
人类基因组的变异性发生在不同的尺度上,从单核苷酸多态性 (SNP) 到大片段 DNA。拷贝数变异 (CNV) 代表了我们遗传异质性的重要组成部分,也与许多疾病和障碍有关。短的、局部的 CNV 可能在人类疾病中发挥重要作用,但在嘈杂的基因组数据中可能无法检测到。因此,需要稳健的方法来检测它们。此外,为了对病理性 CNV 进行有意义的识别,需要估计正常等位基因的异常。
我们开发了一种基于信号处理的序列去噪方法,然后进行模式匹配,以提高基因组数据的信噪比,从而提高 CNV 检测的准确性。我们将这种信号分解匹配滤波 (SDMF) 方法应用于 429 个正常基因组序列,并将检测到的 CNV 与基因组变异数据库中的 CNV 进行比较。SDMF 成功地检测到了大量以前已确定的、出现频率≥10%的 CNV,以及未报告的短 CNV。它的性能也与循环二进制分割 (CBS) 进行了比较。通过模拟。SDMF 的假阳性率显著降低,并且比 CBS 快得多,这对于处理使用高分辨率阵列生成的大型数据集来说是一个重要的优势。通过专注于提高信噪比 (而不是检测算法的稳健性),SDMF 是一种非常有前途的方法,可以在所有基因组空间尺度上识别 CNV。
数据可在 http://tcga-data.nci.nih.gov/tcga/ 获得。软件和分析序列 ID 列表可在 http://www.hsph.harvard.edu/~betensky/ 获得。经验模态分解的 Matlab 代码可在以下网址找到:http://www.clear.rice.edu/elec301/Projects02/empiricalMode/code.html