Bioinformatics Group, Wageningen University, Wageningen, The Netherlands.
Biosystematics Group, Wageningen University, Wageningen, The Netherlands.
BMC Bioinformatics. 2018 Sep 26;19(1):340. doi: 10.1186/s12859-018-2362-4.
Identification of homologous genes is fundamental to comparative genomics, functional genomics and phylogenomics. Extensive public homology databases are of great value for investigating homology but need to be continually updated to incorporate new sequences. As new sequences are rapidly being generated, there is a need for efficient standalone tools to detect homologs in novel data.
To address this, we present a fast method for detecting homology groups across a large number of individuals and/or species. We adopted a k-mer based approach which considerably reduces the number of pairwise protein alignments without sacrificing sensitivity. We demonstrate accuracy, scalability, efficiency and applicability of the presented method for detecting homology in large proteomes of bacteria, fungi, plants and Metazoa.
We clearly observed the trade-off between recall and precision in our homology inference. Favoring recall or precision strongly depends on the application. The clustering behavior of our program can be optimized for particular applications by altering a few key parameters. The program is available for public use at https://github.com/sheikhizadeh/pantools as an extension to our pan-genomic analysis tool, PanTools.
同源基因的鉴定是比较基因组学、功能基因组学和系统发生基因组学的基础。广泛的公共同源数据库对于研究同源性非常有价值,但需要不断更新以纳入新的序列。随着新序列的快速生成,需要有效的独立工具来检测新数据中的同源物。
为了解决这个问题,我们提出了一种快速的方法来检测大量个体和/或物种中的同源基因簇。我们采用了基于 k-mer 的方法,大大减少了两两蛋白质比对的数量,而不牺牲敏感性。我们证明了所提出的方法在检测细菌、真菌、植物和后生动物的大型蛋白质组中的同源性的准确性、可扩展性、效率和适用性。
我们在同源推断中清楚地观察到召回率和精度之间的权衡。偏向召回率或精度强烈取决于应用。通过改变几个关键参数,可以优化程序的聚类行为以适应特定的应用。该程序可在 https://github.com/sheikhizadeh/pantools 上公开使用,作为我们泛基因组分析工具 PanTools 的扩展。