Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki, Thessaloniki, Greece.
PLoS One. 2013;8(1):e52854. doi: 10.1371/journal.pone.0052854. Epub 2013 Jan 14.
Phylogenetic profiles express the presence or absence of genes and their homologs across a number of reference genomes. They have emerged as an elegant representation framework for comparative genomics and have been used for the genome-wide inference and discovery of functionally linked genes or metabolic pathways. As the number of reference genomes grows, there is an acute need for faster and more accurate methods for phylogenetic profile analysis with increased performance in speed and quality. We propose a novel, efficient method for the detection of genomic idiosyncrasies, i.e. sets of genes found in a specific genome with peculiar phylogenetic properties, such as intra-genome correlations or inter-genome relationships. Our algorithm is a four-step process where genome profiles are first defined as fuzzy vectors, then discretized to binary vectors, followed by a de-noising step, and finally a comparison step to generate intra- and inter-genome distances for each gene profile. The method is validated with a carefully selected benchmark set of five reference genomes, using a range of approaches regarding similarity metrics and pre-processing stages for noise reduction. We demonstrate that the fuzzy profile method consistently identifies the actual phylogenetic relationship and origin of the genes under consideration for the majority of the cases, while the detected outliers are found to be particular genes with peculiar phylogenetic patterns. The proposed method provides a time-efficient and highly scalable approach for phylogenetic stratification, with the detected groups of genes being either similar to their own genome profile or different from it, thus revealing atypical evolutionary histories.
系统发生轮廓表达了在许多参考基因组中基因及其同源物的存在或缺失。它们已成为比较基因组学的一种优雅表示框架,并被用于功能相关基因或代谢途径的全基因组推断和发现。随着参考基因组数量的增加,人们迫切需要更快、更准确的方法来进行系统发生轮廓分析,以提高速度和质量方面的性能。我们提出了一种新颖、高效的方法来检测基因组特征,即发现在特定基因组中具有特殊系统发生特性的基因集,例如基因组内相关性或基因组间关系。我们的算法是一个四步过程,首先将基因组轮廓定义为模糊向量,然后将其离散化为二进制向量,接着进行去噪步骤,最后进行比较步骤,为每个基因轮廓生成基因组内和基因组间的距离。该方法使用一系列关于相似性度量和降噪预处理阶段的方法,在精心挑选的五个参考基因组的基准数据集上进行了验证。我们证明,模糊轮廓方法能够一致地识别出所考虑基因的实际系统发生关系和起源,而检测到的异常值被发现是具有特殊系统发生模式的特定基因。所提出的方法为系统发生分层提供了一种高效、高度可扩展的方法,所检测到的基因组要么与它们自己的基因组轮廓相似,要么与它们不同,从而揭示了非典型的进化历史。