Anderson Bradley D, Gilson Michael C, Scott Abigail A, Biehl Bryan S, Glasner Jeremy D, Rajashekara Gireesh, Splitter Gary A, Perna Nicole T
Animal Health and Biomedical Sciences, University of Wisconsin, Madison, WI 53706, USA.
BMC Genomics. 2006 Apr 25;7:91. doi: 10.1186/1471-2164-7-91.
Comparative genomic hybridization can rapidly identify chromosomal regions that vary between organisms and tissues. This technique has been applied to detecting differences between normal and cancerous tissues in eukaryotes as well as genomic variability in microbial strains and species. The density of oligonucleotide probes available on current microarray platforms is particularly well-suited for comparisons of organisms with smaller genomes like bacteria and yeast where an entire genome can be assayed on a single microarray with high resolution. Available methods for analyzing these experiments typically confine analyses to data from pre-defined annotated genome features, such as entire genes. Many of these methods are ill suited for datasets with the number of measurements typical of high-density microarrays.
We present an algorithm for analyzing microarray hybridization data to aid identification of regions that vary between an unsequenced genome and a sequenced reference genome. The program, CGHScan, uses an iterative random walk approach integrating multi-layered significance testing to detect these regions from comparative genomic hybridization data. The algorithm tolerates a high level of noise in measurements of individual probe intensities and is relatively insensitive to the choice of method for normalizing probe intensity values and identifying probes that differ between samples. When applied to comparative genomic hybridization data from a published experiment, CGHScan identified eight of nine known deletions in a Brucella ovis strain as compared to Brucella melitensis. The same result was obtained using two different normalization methods and two different scores to classify data for individual probes as representing conserved or variable genomic regions. The undetected region is a small (58 base pair) deletion that is below the resolution of CGHScan given the array design employed in the study.
CGHScan is an effective tool for analyzing comparative genomic hybridization data from high-density microarrays. The algorithm is capable of accurately identifying known variable regions and is tolerant of high noise and varying methods of data preprocessing. Statistical analysis is used to define each variable region providing a robust and reliable method for rapid identification of genomic differences independent of annotated gene boundaries.
比较基因组杂交能够快速识别生物体和组织之间存在差异的染色体区域。这项技术已被应用于检测真核生物正常组织与癌组织之间的差异,以及微生物菌株和物种的基因组变异性。当前微阵列平台上可用的寡核苷酸探针密度特别适合比较基因组较小的生物体,如细菌和酵母,在这种情况下,整个基因组可以在单个微阵列上以高分辨率进行检测。分析这些实验的现有方法通常将分析局限于来自预定义注释基因组特征(如完整基因)的数据。其中许多方法不适用于具有高密度微阵列典型测量数量的数据集。
我们提出了一种用于分析微阵列杂交数据的算法,以帮助识别未测序基因组与已测序参考基因组之间存在差异的区域。该程序CGHScan使用迭代随机游走方法,结合多层显著性检验,从比较基因组杂交数据中检测这些区域。该算法能够容忍单个探针强度测量中的高水平噪声,并且对归一化探针强度值和识别样本间差异探针的方法选择相对不敏感。当应用于已发表实验的比较基因组杂交数据时,与羊种布鲁氏菌相比,CGHScan在绵羊布鲁氏菌菌株中识别出了九个已知缺失中的八个。使用两种不同的归一化方法和两种不同的分数将单个探针的数据分类为代表保守或可变基因组区域时,得到了相同的结果。未检测到的区域是一个小(58个碱基对)缺失,鉴于该研究中使用的阵列设计,其低于CGHScan的分辨率。
CGHScan是分析高密度微阵列比较基因组杂交数据的有效工具。该算法能够准确识别已知的可变区域,并且能够容忍高噪声和不同的数据预处理方法。使用统计分析来定义每个可变区域,为快速识别独立于注释基因边界的基因组差异提供了一种稳健可靠的方法。