Nikolski Macha, Sherman David J
CNRS/LaBRI, Université Bordeaux 1 351 cours de la Libération, 33405 Talence Cedex, France.
Bioinformatics. 2007 Jan 15;23(2):e71-6. doi: 10.1093/bioinformatics/btl314.
Reliable identification of protein families is key to phylogenetic analysis, functional annotation and the exploration of protein function diversity in a given phylogenetic branch. As more and more complete genomes are sequenced, there is a need for powerful and reliable algorithms facilitating protein families construction.
We have formulated the problem of protein families construction as an instance of consensus clustering, for which we designed a novel algorithm that is computationally efficient in practice and produces high quality results. Our algorithm uses an election method to construct consensus families from competing clustering computations. Our consensus clustering algorithm is tailored to serve the specific needs of comparative genomics projects. First, it provides a robust means to incorporate results from different and complementary clustering methods, thus avoiding the need for an a priori choice that may introduce computational bias in the results. Second, it is suited to large-scale projects due to the practical efficiency. And third, it produces high quality results where families tend to represent groupings by biological function.
This method has been used for Génolevures project to compute protein families of Hemiascomycetous yeasts. The data are available online at http://cbi.labri.fr/Genolevures/fam/
可靠地识别蛋白质家族是系统发育分析、功能注释以及探索给定系统发育分支中蛋白质功能多样性的关键。随着越来越多的完整基因组被测序,需要强大且可靠的算法来促进蛋白质家族的构建。
我们已将蛋白质家族构建问题表述为共识聚类的一个实例,为此我们设计了一种新颖的算法,该算法在实际计算中效率很高且能产生高质量的结果。我们的算法使用一种选举方法,从相互竞争的聚类计算中构建共识家族。我们的共识聚类算法是为满足比较基因组学项目的特定需求而量身定制的。首先,它提供了一种稳健的方法来整合来自不同且互补的聚类方法的结果,从而避免了可能在结果中引入计算偏差的先验选择的必要性。其次,由于其实际效率,它适用于大规模项目。第三,它能产生高质量的结果,其中家族倾向于按生物学功能进行分组。
此方法已用于Génolevures项目,以计算半子囊菌酵母的蛋白质家族。数据可在http://cbi.labri.fr/Genolevures/fam/在线获取。