Department of Integrated Biosciences, Graduate School of Frontier Sciences, the University of Tokyo, Kashiwa, Japan.
Center of Excellence in Computational Molecular Biology, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand.
Genome Biol. 2024 Jul 25;25(1):195. doi: 10.1186/s13059-024-03298-4.
Accurate inference of orthologous genes constitutes a prerequisite for comparative and evolutionary genomics. SonicParanoid is one of the fastest tools for orthology inference; however, its scalability and accuracy have been hampered by time-consuming all-versus-all alignments and the existence of proteins with complex domain architectures. Here, we present a substantial update of SonicParanoid, where a gradient boosting predictor halves the execution time and a language model doubles the recall. Application to empirical large-scale and standardized benchmark datasets shows that SonicParanoid2 is much faster than comparable methods and also the most accurate. SonicParanoid2 is available at https://gitlab.com/salvo981/sonicparanoid2 and https://zenodo.org/doi/10.5281/zenodo.11371108 .
准确推断直系同源基因是比较和进化基因组学的前提。SonicParanoid 是最快速的直系同源基因推断工具之一;然而,其可扩展性和准确性受到耗时的全对全比对和具有复杂结构域架构的蛋白质的限制。在这里,我们对 SonicParanoid 进行了重大更新,其中梯度提升预测器将执行时间缩短了一半,语言模型将召回率提高了一倍。在经验丰富的大规模和标准化基准数据集上的应用表明,SonicParanoid2 比可比方法快得多,而且也更准确。SonicParanoid2 可在 https://gitlab.com/salvo981/sonicparanoid2 和 https://zenodo.org/doi/10.5281/zenodo.11371108 获得。