Snipen Lars, Repsilber Dirk, Nyquist Ludvig, Ziegler Andreas, Aakra Agot, Aastveit Are
Biostatistics, Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, N-1432 As, Norway.
BMC Bioinformatics. 2006 Mar 30;7:181. doi: 10.1186/1471-2105-7-181.
Array-based comparative genome hybridization (aCGH) is a tool for rapid comparison of genomes from different bacterial strains. The purpose of such analysis is to detect highly divergent or absent genes in a sample strain compared to an index strain. Development of methods for analyzing aCGH data has primarily focused on copy number abberations in cancer research. In microbial aCGH analyses, genes are typically ranked by log-ratios, and classification into divergent or present is done by choosing a cutoff log-ratio, either manually or by statistics calculated from the log-ratio distribution. As experimental settings vary considerably, it is not possible to develop a classical discriminant or statistical learning approach.
We introduce a more efficient method for analyzing microbial aCGH data using a finite mixture model and a data rotation scheme. Using the average posterior probabilities from the model fitted to log-ratios before and after rotation, we get a score for each gene, and demonstrate its advantages for ranking and detecting divergent genes with enlarged specificity and sensitivity.
The procedure is tested and compared to other approaches on simulated data sets, as well as on four experimental validation data sets for aCGH analysis on fully sequenced strains of Staphylococcus aureus and Streptococcus pneumoniae.
When tested on simulated data as well as on four different experimental validation data sets from experiments with only fully sequenced strains, our procedure out-competes the standard procedures of using a simple log-ratio cutoff for classification into present and divergent genes.
基于芯片的比较基因组杂交(aCGH)是一种用于快速比较不同细菌菌株基因组的工具。此类分析的目的是检测样本菌株中与参照菌株相比高度分化或缺失的基因。分析aCGH数据方法的开发主要集中在癌症研究中的拷贝数畸变。在微生物aCGH分析中,基因通常按对数比率排序,通过选择一个截断对数比率(手动或根据对数比率分布计算的统计量)来进行分化或存在的分类。由于实验设置差异很大,因此无法开发经典的判别或统计学习方法。
我们引入了一种使用有限混合模型和数据旋转方案来分析微生物aCGH数据的更有效方法。利用拟合到旋转前后对数比率的模型的平均后验概率,我们为每个基因获得一个分数,并证明其在以更高的特异性和敏感性对分化基因进行排序和检测方面的优势。
该程序在模拟数据集以及金黄色葡萄球菌和肺炎链球菌全序列菌株的aCGH分析的四个实验验证数据集上进行了测试,并与其他方法进行了比较。
当在模拟数据以及仅来自全序列菌株实验的四个不同实验验证数据集上进行测试时,我们的程序优于使用简单对数比率截断将基因分类为存在和分化基因的标准程序。