Wilfrid Laurier University, Department of Biology, 75 University Ave W, Waterloo, N2L 3C5, ON, Canada.
BMC Genomics. 2020 Oct 24;21(1):741. doi: 10.1186/s12864-020-07132-6.
Finding orthologs remains an important bottleneck in comparative genomics analyses. While the authors of software for the quick comparison of protein sequences evaluate the speed of their software and compare their results against the most usual software for the task, it is not common for them to evaluate their software for more particular uses, such as finding orthologs as reciprocal best hits (RBH). Here we compared RBH results obtained using software that runs faster than blastp. Namely, lastal, diamond, and MMseqs2.
We found that lastal required the least time to produce results. However, it yielded fewer results than any other program when comparing the proteins encoded by evolutionarily distant genomes. The program producing the most similar number of RBH to blastp was diamond ran with the "ultra-sensitive" option. However, this option was diamond's slowest, with the "very-sensitive" option offering the best balance between speed and RBH results. The speeding up of the programs was much more evident when dealing with eukaryotic genomes, which code for more numerous proteins. For example, lastal took a median of approx. 1.5% of the blastp time to run with bacterial proteomes and 0.6% with eukaryotic ones, while diamond with the very-sensitive option took 7.4% and 5.2%, respectively. Though estimated error rates were very similar among the RBH obtained with all programs, RBH obtained with MMseqs2 had the lowest error rates among the programs tested.
The fast algorithms for pairwise protein comparison produced results very similar to blast in a fraction of the time, with diamond offering the best compromise in speed, sensitivity and quality, as long as a sensitivity option, other than the default, was chosen.
在比较基因组学分析中,寻找同源物仍然是一个重要的瓶颈。虽然用于快速比较蛋白质序列的软件的作者会评估其软件的速度,并将其结果与最常用的任务软件进行比较,但他们通常不会评估其软件在更特殊用途方面的性能,例如使用互为最佳匹配(RBH)的方法来寻找同源物。在这里,我们比较了使用比 blastp 运行速度更快的软件获得的 RBH 结果。即 lastal、diamond 和 MMseqs2。
我们发现 lastal 生成结果所需的时间最短。然而,当比较进化上较远的基因组编码的蛋白质时,它产生的结果比其他任何程序都少。与 blastp 产生最相似数量 RBH 的程序是 diamond 运行的“超敏感”选项。然而,此选项是 diamond 运行最慢的选项,而“非常敏感”选项在速度和 RBH 结果之间提供了最佳的平衡。在处理编码蛋白质数量更多的真核生物基因组时,程序的加速更为明显。例如,lastal 运行细菌蛋白质组的平均时间约为 blastp 的 1.5%,运行真核生物蛋白质组的时间约为 0.6%,而 diamond 运行非常敏感选项的时间分别为 7.4%和 5.2%。尽管所有程序获得的 RBH 的估计错误率非常相似,但在测试的程序中,MMseqs2 获得的 RBH 的错误率最低。
用于两两蛋白质比较的快速算法在时间的一小部分中产生了与 blast 非常相似的结果,而 diamond 在速度、灵敏度和质量方面提供了最佳的折衷方案,只要选择了默认选项以外的灵敏度选项。