Max Planck Institute for Informatics, Saarland University, 66123 Saarbrücken, Germany.
Bioinformatics. 2013 Jan 15;29(2):215-22. doi: 10.1093/bioinformatics/bts653. Epub 2012 Nov 9.
Homology detection is a long-standing challenge in computational biology. To tackle this problem, typically all-versus-all BLAST results are coupled with data partitioning approaches resulting in clusters of putative homologous proteins. One of the main problems, however, has been widely neglected: all clustering tools need a density parameter that adjusts the number and size of the clusters. This parameter is crucial but hard to estimate without gold standard data at hand. Developing a gold standard, however, is a difficult and time consuming task. Having a reliable method for detecting clusters of homologous proteins between a huge set of species would open opportunities for better understanding the genetic repertoire of bacteria with different lifestyles.
Our main contribution is a method for identifying a suitable and robust density parameter for protein homology detection without a given gold standard. Therefore, we study the core genome of 89 actinobacteria. This allows us to incorporate background knowledge, i.e. the assumption that a set of evolutionarily closely related species should share a comparably high number of evolutionarily conserved proteins (emerging from phylum-specific housekeeping genes). We apply our strategy to find genes/proteins that are specific for certain actinobacterial lifestyles, i.e. different types of pathogenicity. The whole study was performed with transitivity clustering, as it only requires a single intuitive density parameter and has been shown to be well applicable for the task of protein sequence clustering. Note, however, that the presented strategy generally does not depend on our clustering method but can easily be adapted to other clustering approaches.
All results are publicly available at http://transclust.mmci.uni-saarland.de/actino_core/ or as Supplementary Material of this article.
Supplementary data are available at Bioinformatics online.
同源性检测是计算生物学中的一个长期存在的挑战。为了解决这个问题,通常将所有与所有 BLAST 结果与数据分区方法相结合,从而产生假定同源蛋白的聚类。然而,其中一个主要问题一直被广泛忽视:所有聚类工具都需要一个密度参数来调整聚类的数量和大小。这个参数是至关重要的,但在没有手头的黄金标准数据的情况下很难估计。然而,开发黄金标准是一项困难且耗时的任务。拥有一种可靠的方法来检测大量物种之间同源蛋白的聚类,将为更好地理解具有不同生活方式的细菌的遗传组成提供机会。
我们的主要贡献是一种在没有给定黄金标准的情况下识别蛋白质同源性检测合适且稳健的密度参数的方法。因此,我们研究了 89 种放线菌的核心基因组。这使我们能够整合背景知识,即一组进化上密切相关的物种应该共享相对较高数量的进化保守蛋白(源自门特异性的管家基因)。我们应用我们的策略来寻找特定放线菌生活方式(即不同类型的致病性)特有的基因/蛋白。整个研究使用传递聚类来完成,因为它只需要一个单一的直观密度参数,并且已经证明它非常适用于蛋白质序列聚类的任务。请注意,然而,所提出的策略通常不依赖于我们的聚类方法,但可以轻松适应其他聚类方法。
所有结果均可在 http://transclust.mmci.uni-saarland.de/actino_core/ 或本文的补充材料中获得。
补充数据可在生物信息学在线获得。