Institute of Physics, Federal University of Bahia, Campus Universitário de Ondina, Salvador, Bahia, Brazil.
PLoS Comput Biol. 2011 May;7(5):e1001131. doi: 10.1371/journal.pcbi.1001131. Epub 2011 May 5.
This paper proposes a new method to identify communities in generally weighted complex networks and apply it to phylogenetic analysis. In this case, weights correspond to the similarity indexes among protein sequences, which can be used for network construction so that the network structure can be analyzed to recover phylogenetically useful information from its properties. The analyses discussed here are mainly based on the modular character of protein similarity networks, explored through the Newman-Girvan algorithm, with the help of the neighborhood matrix . The most relevant networks are found when the network topology changes abruptly revealing distinct modules related to the sets of organisms to which the proteins belong. Sound biological information can be retrieved by the computational routines used in the network approach, without using biological assumptions other than those incorporated by BLAST. Usually, all the main bacterial phyla and, in some cases, also some bacterial classes corresponded totally (100%) or to a great extent (>70%) to the modules. We checked for internal consistency in the obtained results, and we scored close to 84% of matches for community pertinence when comparisons between the results were performed. To illustrate how to use the network-based method, we employed data for enzymes involved in the chitin metabolic pathway that are present in more than 100 organisms from an original data set containing 1,695 organisms, downloaded from GenBank on May 19, 2007. A preliminary comparison between the outcomes of the network-based method and the results of methods based on Bayesian, distance, likelihood, and parsimony criteria suggests that the former is as reliable as these commonly used methods. We conclude that the network-based method can be used as a powerful tool for retrieving modularity information from weighted networks, which is useful for phylogenetic analysis.
本文提出了一种新的方法来识别一般加权复杂网络中的社区,并将其应用于系统发育分析。在这种情况下,权重对应于蛋白质序列之间的相似性指数,可以用于网络构建,以便从网络结构中分析出具有系统发育意义的信息。这里讨论的分析主要基于蛋白质相似网络的模块特性,通过 Newman-Girvan 算法进行探索,并借助邻接矩阵 。当网络拓扑结构发生突然变化,揭示出与蛋白质所属生物体集相关的明显模块时,就可以找到最相关的网络。通过网络方法中使用的计算例程,可以检索到可靠的生物学信息,而无需使用除 BLAST 中包含的生物学假设以外的其他假设。通常,所有主要的细菌门,在某些情况下,一些细菌纲也完全(100%)或在很大程度上(>70%)对应于模块。我们检查了所得到的结果的内部一致性,并且当对结果进行比较时,我们获得了接近 84%的社区相关性匹配得分。为了说明如何使用基于网络的方法,我们使用了来自 GenBank 于 2007 年 5 月 19 日下载的包含 1695 个生物体的原始数据集的超过 100 个生物体中参与几丁质代谢途径的酶的数据。基于网络的方法的结果与基于贝叶斯、距离、似然和简约标准的方法的结果之间的初步比较表明,前者与这些常用方法一样可靠。我们得出结论,基于网络的方法可以用作从加权网络中检索模块性信息的强大工具,这对于系统发育分析很有用。