SonicParanoid2：使用机器学习和语言模型实现快速、准确、全面的直系同源推断。

SonicParanoid2: fast, accurate, and comprehensive orthology inference with machine learning and language models.

机构信息

Department of Integrated Biosciences, Graduate School of Frontier Sciences, the University of Tokyo, Kashiwa, Japan.

Center of Excellence in Computational Molecular Biology, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand.

出版信息

Genome Biol. 2024 Jul 25;25(1):195. doi: 10.1186/s13059-024-03298-4.

DOI:10.1186/s13059-024-03298-4

PMID:39054525

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11270883/

Abstract

Accurate inference of orthologous genes constitutes a prerequisite for comparative and evolutionary genomics. SonicParanoid is one of the fastest tools for orthology inference; however, its scalability and accuracy have been hampered by time-consuming all-versus-all alignments and the existence of proteins with complex domain architectures. Here, we present a substantial update of SonicParanoid, where a gradient boosting predictor halves the execution time and a language model doubles the recall. Application to empirical large-scale and standardized benchmark datasets shows that SonicParanoid2 is much faster than comparable methods and also the most accurate. SonicParanoid2 is available at https://gitlab.com/salvo981/sonicparanoid2 and https://zenodo.org/doi/10.5281/zenodo.11371108 .

摘要

准确推断直系同源基因是比较和进化基因组学的前提。SonicParanoid 是最快速的直系同源基因推断工具之一；然而，其可扩展性和准确性受到耗时的全对全比对和具有复杂结构域架构的蛋白质的限制。在这里，我们对 SonicParanoid 进行了重大更新，其中梯度提升预测器将执行时间缩短了一半，语言模型将召回率提高了一倍。在经验丰富的大规模和标准化基准数据集上的应用表明，SonicParanoid2 比可比方法快得多，而且也更准确。SonicParanoid2 可在 https://gitlab.com/salvo981/sonicparanoid2 和 https://zenodo.org/doi/10.5281/zenodo.11371108 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be5f/11270883/a09d21907aa3/13059_2024_3298_Fig1_HTML.jpg

相似文献

SonicParanoid2: fast, accurate, and comprehensive orthology inference with machine learning and language models.SonicParanoid2：使用机器学习和语言模型实现快速、准确、全面的直系同源推断。

Genome Biol. 2024 Jul 25;25(1):195. doi: 10.1186/s13059-024-03298-4.

SonicParanoid: fast, accurate and easy orthology inference.SonicParanoid：快速、准确、易用的直系同源推断。

Bioinformatics. 2019 Jan 1;35(1):149-151. doi: 10.1093/bioinformatics/bty631.

OrthoFinder: phylogenetic orthology inference for comparative genomics.OrthoFinder：用于比较基因组学的系统发育直系同源推断。

Genome Biol. 2019 Nov 14;20(1):238. doi: 10.1186/s13059-019-1832-y.

Orthologous Matrix (OMA) algorithm 2.0: more robust to asymmetric evolutionary rates and more scalable hierarchical orthologous group inference.直系同源矩阵（OMA）算法2.0：对不对称进化速率更具鲁棒性，且在分层直系同源组推断方面更具扩展性。

Bioinformatics. 2017 Jul 15;33(14):i75-i82. doi: 10.1093/bioinformatics/btx229.

ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels.ML-DSP：利用数字信号处理进行机器学习，实现了在所有分类学水平上的超快、准确和可扩展的基因组分类。

BMC Genomics. 2019 Apr 3;20(1):267. doi: 10.1186/s12864-019-5571-y.

Domainoid: domain-oriented orthology inference.域型体：面向域的直系同源推断。

BMC Bioinformatics. 2019 Oct 28;20(1):523. doi: 10.1186/s12859-019-3137-2.

SplicedFamAlign: CDS-to-gene spliced alignment and identification of transcript orthology groups. splicedFamAlign：CDS 到基因拼接对齐和转录本同源物组的鉴定。

BMC Bioinformatics. 2019 Mar 29;20(Suppl 3):133. doi: 10.1186/s12859-019-2647-2.

chainCleaner improves genome alignment specificity and sensitivity.链清洁器提高了基因组比对的特异性和灵敏度。

Bioinformatics. 2017 Jun 1;33(11):1596-1603. doi: 10.1093/bioinformatics/btx024.

Benchmarking orthology methods using phylogenetic patterns defined at the base of Eukaryotes.使用在真核生物基部定义的系统发育模式对同源物方法进行基准测试。

Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa206.

SwiftOrtho: A fast, memory-efficient, multiple genome orthology classifier.SwiftOrtho：一种快速、内存高效、多基因组同源分类器。

Gigascience. 2019 Oct 1;8(10). doi: 10.1093/gigascience/giz118.

引用本文的文献

GPU-accelerated homology search with MMseqs2.使用MMseqs2进行GPU加速的同源性搜索。

Nat Methods. 2025 Sep 18. doi: 10.1038/s41592-025-02819-8.

Carotenoids bind rhodopsins and act as photocycle-accelerating pigments in marine Bacteroidota.类胡萝卜素与视紫红质结合，并在海洋拟杆菌门中作为光循环加速色素发挥作用。

Nat Microbiol. 2025 Sep 4. doi: 10.1038/s41564-025-02109-1.

Insect Phylogenomics: From Experiment Planning to Post-phylogenetic Analyses.昆虫系统发育基因组学：从实验规划到系统发育后分析

Methods Mol Biol. 2025;2935:211-235. doi: 10.1007/978-1-0716-4583-3_9.

Feature Architecture-Aware Ortholog Search With fDOG Reveals the Distribution of Plant Cell Wall-Degrading Enzymes Across Life.基于fDOG的特征架构感知直系同源物搜索揭示了植物细胞壁降解酶在生命中的分布。

Mol Biol Evol. 2025 Jun 4;42(6). doi: 10.1093/molbev/msaf120.

Predicting Protein Function in the AI and Big Data Era.人工智能与大数据时代的蛋白质功能预测

Biochemistry. 2025 Jun 3;64(11):2345-2352. doi: 10.1021/acs.biochem.5c00186. Epub 2025 May 17.

SOI: robust identification of orthologous synteny with the Orthology Index and broad applications in evolutionary genomics.SOI：利用直系同源索引对直系同源同线性进行可靠识别及其在进化基因组学中的广泛应用。

Nucleic Acids Res. 2025 Apr 10;53(7). doi: 10.1093/nar/gkaf320.

Constructing multilayer PPI networks based on homologous proteins and integrating multiple PageRank to identify essential proteins.基于同源蛋白构建多层蛋白质-蛋白质相互作用网络并整合多个PageRank算法以识别关键蛋白质。

BMC Bioinformatics. 2025 Mar 10;26(1):80. doi: 10.1186/s12859-025-06093-5.

Different orthology inference algorithms generate similar predicted orthogroups among Brassicaceae species.不同的直系同源推断算法在十字花科物种中生成相似的预测直系同源组。

Appl Plant Sci. 2024 Dec 25;13(1):e11627. doi: 10.1002/aps3.11627. eCollection 2025 Jan-Feb.

Orthology inference at scale with FastOMA.使用FastOMA进行大规模直系同源推断。

Nat Methods. 2025 Feb;22(2):269-272. doi: 10.1038/s41592-024-02552-8. Epub 2025 Jan 3.

New developments for the Quest for Orthologs benchmark service.直系同源物搜索基准服务的新进展。

NAR Genom Bioinform. 2024 Dec 11;6(4):lqae167. doi: 10.1093/nargab/lqae167. eCollection 2024 Dec.

本文引用的文献

Proteinortho6: pseudo-reciprocal best alignment heuristic for graph-based detection of (co-)orthologs.Proteinortho6：用于基于图形检测（共）直系同源物的伪互反最佳比对启发式方法。

Front Bioinform. 2023 Dec 13;3:1322477. doi: 10.3389/fbinf.2023.1322477. eCollection 2023.

FAS: assessing the similarity between proteins using multi-layered feature architectures.FAS：使用多层特征架构评估蛋白质之间的相似性。

Bioinformatics. 2023 May 4;39(5). doi: 10.1093/bioinformatics/btad226.

Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。

Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.

InParanoiDB 9: Ortholog Groups for Protein Domains and Full-Length Proteins.InParanoiDB 9：蛋白质结构域和全长蛋白质的直系同源组。

J Mol Biol. 2023 Jul 15;435(14):168001. doi: 10.1016/j.jmb.2023.168001. Epub 2023 Feb 9.

OrthoDB v11: annotation of orthologs in the widest sampling of organismal diversity.OrthoDB v11：在最广泛的生物多样性样本中注释直系同源物。

Nucleic Acids Res. 2023 Jan 6;51(D1):D445-D451. doi: 10.1093/nar/gkac998.

GenBank 2023 update.GenBank 2023 更新。

Nucleic Acids Res. 2023 Jan 6;51(D1):D141-D144. doi: 10.1093/nar/gkac1012.

The Quest for Orthologs orthology benchmark service in 2022.2022 年的同源基因基准服务探索。

Nucleic Acids Res. 2022 Jul 5;50(W1):W623-W632. doi: 10.1093/nar/gkac330.

A deep siamese neural network improves metagenome-assembled genomes in microbiome datasets across different environments.深度暹罗神经网络提高了不同环境中微生物组数据集的宏基因组组装基因组。

Nat Commun. 2022 Apr 28;13(1):2326. doi: 10.1038/s41467-022-29843-y.

PhylomeDB V5: an expanding repository for genome-wide catalogues of annotated gene phylogenies.PhylomeDB V5：一个不断扩展的基因组注释基因系统发育目录存储库。

Nucleic Acids Res. 2022 Jan 7;50(D1):D1062-D1068. doi: 10.1093/nar/gkab966.

KinOrtho: a method for mapping human kinase orthologs across the tree of life and illuminating understudied kinases.KinOrtho：一种在生命之树中映射人类激酶直系同源物并阐明研究不足的激酶的方法。

BMC Bioinformatics. 2021 Sep 18;22(1):446. doi: 10.1186/s12859-021-04358-3.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

SonicParanoid2：使用机器学习和语言模型实现快速、准确、全面的直系同源推断。

SonicParanoid2: fast, accurate, and comprehensive orthology inference with machine learning and language models.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献