Derelle Romain, Philippe Hervé, Colbourne John K
School of Biosciences, University of Birmingham, Birmingham, United Kingdom.
Station d'Ecologie Théorique et Expérimentale, UMR CNRS 5321, Moulis, France.
Mol Biol Evol. 2020 Nov 1;37(11):3389-3396. doi: 10.1093/molbev/msaa159.
Orthology assignment is a key step of comparative genomic studies, for which many bioinformatic tools have been developed. However, all gene clustering pipelines are based on the analysis of protein distances, which are subject to many artifacts. In this article, we introduce Broccoli, a user-friendly pipeline designed to infer, with high precision, orthologous groups, and pairs of proteins using a phylogeny-based approach. Briefly, Broccoli performs ultrafast phylogenetic analyses on most proteins and builds a network of orthologous relationships. Orthologous groups are then identified from the network using a parameter-free machine learning algorithm. Broccoli is also able to detect chimeric proteins resulting from gene-fusion events and to assign these proteins to the corresponding orthologous groups. Tested on two benchmark data sets, Broccoli outperforms current orthology pipelines. In addition, Broccoli is scalable, with runtimes similar to those of recent distance-based pipelines. Given its high level of performance and efficiency, this new pipeline represents a suitable choice for comparative genomic studies. Broccoli is freely available at https://github.com/rderelle/Broccoli.
直系同源物分配是比较基因组研究的关键步骤,为此已开发了许多生物信息学工具。然而,所有基因聚类流程都基于蛋白质距离分析,而蛋白质距离容易受到多种假象的影响。在本文中,我们介绍了Broccoli,这是一个用户友好的流程,旨在使用基于系统发育的方法高精度地推断直系同源组和蛋白质对。简而言之,Broccoli对大多数蛋白质进行超快速系统发育分析,并构建直系同源关系网络。然后使用无参数机器学习算法从网络中识别直系同源组。Broccoli还能够检测由基因融合事件产生的嵌合蛋白,并将这些蛋白分配到相应的直系同源组。在两个基准数据集上进行测试时,Broccoli的表现优于当前的直系同源流程。此外,Broccoli具有可扩展性,运行时间与最近基于距离的流程相似。鉴于其高性能和高效率,这个新流程是比较基因组研究的合适选择。可在https://github.com/rderelle/Broccoli上免费获取Broccoli。