F1ALA：应用于庞大的新冠病毒系统发育树的超快速且内存高效的祖先谱系注释

F1ALA: ultrafast and memory-efficient ancestral lineage annotation applied to the huge SARS-CoV-2 phylogeny.

作者信息

Ye Yongtao, Shum Marcus H, Wu Isaac, Chau Carlos, Zhao Ningqi, Smith David K, Wu Joseph T, Lam Tommy T

机构信息

State Key Laboratory of Emerging Infectious Diseases, School of Public Health, The University of Hong Kong, Hong Kong SAR, P. R. China.

Laboratory of Data Discovery for Health, 19W Hong Kong Science & Technology Parks, Hong Kong SAR, P. R. China.

出版信息

Virus Evol. 2024 Jul 25;10(1):veae056. doi: 10.1093/ve/veae056. eCollection 2024.

DOI:10.1093/ve/veae056

PMID:39247558

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11378316/

Abstract

The unprecedentedly large size of the global SARS-CoV-2 phylogeny makes any computation on the tree difficult. Lineage identification (e.g. the PANGO nomenclature for SARS-CoV-2) and assignment are key to track the virus evolution. It requires annotating clade roots of lineages to unlabeled ancestral nodes in a phylogenetic tree. Then the lineage labels of descendant samples under these clade roots can be inferred to be the corresponding lineages. This is the ancestral lineage annotation problem, and matUtils (a package in pUShER) and PastML are commonly used methods. However, their computational tractability is a challenge and their accuracy needs further exploration in huge SARS-CoV-2 phylogenies. We have developed an efficient and accurate method, called "F1ALA", that utilizes the F1-score to evaluate the confidence with which a specific ancestral node can be annotated as the clade root of a lineage, given the lineage labels of a set of taxa in a rooted tree. Compared to these methods, F1ALA achieved roughly an order of magnitude faster yet with ∼12% of their memory usage when annotating 2277 PANGO lineages in a phylogeny of 5.26 million taxa. F1ALA allows real-time lineage tracking to be performed on a laptop computer. F1ALA outperformed matUtils (pUShER) with statistical significance, and had comparable accuracy to PastML in tests on empirical and simulated data. F1ALA enables a tree refinement by pruning taxa with inconsistent labels to their closest annotation nodes and re-inserting them back to the pruned tree to improve a SARS-CoV-2 phylogeny with both higher log-likelihood and lower parsimony score. Given the ultrafast speed and high accuracy, we anticipated that F1ALA will also be useful for large phylogenies of other viruses. Codes and benchmark datasets are publicly available at https://github.com/id-bioinfo/F1ALA.

摘要

全球严重急性呼吸综合征冠状病毒2（SARS-CoV-2）系统发育树的规模空前庞大，这使得对该树进行任何计算都很困难。谱系识别（例如SARS-CoV-2的PANGO命名法）和归类是追踪病毒进化的关键。这需要在系统发育树中将谱系的分支根部注释到未标记的祖先节点。然后，可以推断这些分支根部下后代样本的谱系标签为相应的谱系。这就是祖先谱系注释问题，matUtils（pUShER中的一个软件包）和PastML是常用的方法。然而，它们的计算可处理性是一个挑战，并且在庞大的SARS-CoV-2系统发育树中，它们的准确性需要进一步探索。我们开发了一种高效且准确的方法，称为“F1ALA”，该方法利用F1分数来评估在给定有根树中一组分类单元的谱系标签的情况下，将特定祖先节点注释为谱系分支根部的置信度。与这些方法相比，在对526万个分类单元的系统发育树中注释2277个PANGO谱系时，F1ALA的速度快了大约一个数量级，而内存使用量仅为它们的12%左右。F1ALA允许在笔记本电脑上进行实时谱系追踪。在对经验数据和模拟数据的测试中，F1ALA的性能在统计学上显著优于matUtils（pUShER），并且与PastML的准确性相当。F1ALA能够通过将标签不一致的分类单元修剪到其最接近的注释节点，然后将它们重新插入到修剪后的树中，来细化树，从而改进SARS-CoV-2系统发育树，使其具有更高的对数似然性和更低的简约得分。鉴于其超快的速度和高精度，我们预计F1ALA也将对其他病毒的大型系统发育树有用。代码和基准数据集可在https://github.com/id-bioinfo/F1ALA上公开获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/89b6/11378316/58ee4bce9de9/veae056f1.jpg

相似文献

F1ALA: ultrafast and memory-efficient ancestral lineage annotation applied to the huge SARS-CoV-2 phylogeny.F1ALA：应用于庞大的新冠病毒系统发育树的超快速且内存高效的祖先谱系注释

Virus Evol. 2024 Jul 25;10(1):veae056. doi: 10.1093/ve/veae056. eCollection 2024.

A Daily-Updated Database and Tools for Comprehensive SARS-CoV-2 Mutation-Annotated Trees.每日更新的 SARS-CoV-2 突变注释树综合数据库和工具。

Mol Biol Evol. 2021 Dec 9;38(12):5819-5824. doi: 10.1093/molbev/msab264.

Taxonium, a web-based tool for exploring large phylogenetic trees.Taxonium，一个用于探索大型系统发育树的网络工具。

Elife. 2022 Nov 15;11:e82392. doi: 10.7554/eLife.82392.

A daily-updated database and tools for comprehensive SARS-CoV-2 mutation-annotated trees.一个每日更新的数据库及用于全面的严重急性呼吸综合征冠状病毒2（SARS-CoV-2）突变注释树的工具。

bioRxiv. 2021 Jul 13:2021.04.03.438321. doi: 10.1101/2021.04.03.438321.

Robust expansion of phylogeny for fast-growing genome sequence data.快速增长的基因组序列数据的系统发育稳健扩展。

PLoS Comput Biol. 2024 Feb 8;20(2):e1011871. doi: 10.1371/journal.pcbi.1011871. eCollection 2024 Feb.

Pango lineage designation and assignment using SARS-CoV-2 spike gene nucleotide sequences.使用 SARS-CoV-2 刺突基因核苷酸序列对 Pango 谱系进行指定和分配。

BMC Genomics. 2022 Feb 11;23(1):121. doi: 10.1186/s12864-022-08358-2.

Transmission cluster characteristics of global, regional, and lineage-specific SARS-CoV-2 phylogenies.全球、区域和特定谱系的严重急性呼吸综合征冠状病毒2（SARS-CoV-2）系统发育的传播簇特征

IEEE Int Conf Bioinform Biomed Workshops. 2022 Dec;2022:2940-2944. doi: 10.1109/bibm55620.2022.9995364. Epub 2023 Jan 2.

Phylogenetic signatures reveal multilevel selection and fitness costs in SARS-CoV-2.系统发育特征揭示了新冠病毒中的多级选择和适应性代价。

Wellcome Open Res. 2024 Jul 24;9:85. doi: 10.12688/wellcomeopenres.20704.2. eCollection 2024.

CoVizu: Rapid analysis and visualization of the global diversity of SARS-CoV-2 genomes.CoVizu：严重急性呼吸综合征冠状病毒2（SARS-CoV-2）基因组全球多样性的快速分析与可视化

Virus Evol. 2021 Nov 8;7(2):veab092. doi: 10.1093/ve/veab092. eCollection 2021 Dec.

Online Phylogenetics with matOptimize Produces Equivalent Trees and is Dramatically More Efficient for Large SARS-CoV-2 Phylogenies than de novo and Maximum-Likelihood Implementations.在线系统发育学与 matOptimize 产生等效的树，并且比从头开始和最大似然实现对大型 SARS-CoV-2 系统发育更有效率。

Syst Biol. 2023 Nov 1;72(5):1039-1051. doi: 10.1093/sysbio/syad031.

本文引用的文献

SARS-CoV-2 lineage assignments using phylogenetic placement/UShER are superior to pangoLEARN machine-learning method.使用系统发育定位/UShER 对严重急性呼吸综合征冠状病毒 2（SARS-CoV-2）进行谱系分类优于 pangoLEARN 机器学习方法。

Virus Evol. 2024 Jan 11;10(1):vead085. doi: 10.1093/ve/vead085. eCollection 2024.

Robust expansion of phylogeny for fast-growing genome sequence data.快速增长的基因组序列数据的系统发育稳健扩展。

PLoS Comput Biol. 2024 Feb 8;20(2):e1011871. doi: 10.1371/journal.pcbi.1011871. eCollection 2024 Feb.

matOptimize: a parallel tree optimization method enables online phylogenetics for SARS-CoV-2.matOptimize：一种并行树优化方法，支持 SARS-CoV-2 的在线系统发生分析。

Bioinformatics. 2022 Aug 2;38(15):3734-3740. doi: 10.1093/bioinformatics/btac401.

Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool.使用穿山甲工具对新出现的大流行中的流行病学谱系进行分类。

Virus Evol. 2021 Jul 30;7(2):veab064. doi: 10.1093/ve/veab064. eCollection 2021.

A Daily-Updated Database and Tools for Comprehensive SARS-CoV-2 Mutation-Annotated Trees.每日更新的 SARS-CoV-2 突变注释树综合数据库和工具。

Mol Biol Evol. 2021 Dec 9;38(12):5819-5824. doi: 10.1093/molbev/msab264.

Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic.超快现有树木样本放置 (UShER) 可实现 SARS-CoV-2 大流行的实时系统发生学。

Nat Genet. 2021 Jun;53(6):809-816. doi: 10.1038/s41588-021-00862-7. Epub 2021 May 10.

A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology.一种用于 SARS-CoV-2 谱系的动态命名建议，以辅助基因组流行病学研究。

Nat Microbiol. 2020 Nov;5(11):1403-1407. doi: 10.1038/s41564-020-0770-5. Epub 2020 Jul 15.

IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era.IQ-TREE 2：基因组时代系统发育推断的新模型和有效方法。

Mol Biol Evol. 2020 May 1;37(5):1530-1534. doi: 10.1093/molbev/msaa015.

A Fast Likelihood Method to Reconstruct and Visualize Ancestral Scenarios.一种快速的似然方法，用于重建和可视化祖先场景。

Mol Biol Evol. 2019 Sep 1;36(9):2069-2085. doi: 10.1093/molbev/msz131.

GISAID: Global initiative on sharing all influenza data - from vision to reality.全球流感数据共享倡议组织：从愿景到现实的全球共享所有流感数据倡议

Euro Surveill. 2017 Mar 30;22(13). doi: 10.2807/1560-7917.ES.2017.22.13.30494.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

F1ALA：应用于庞大的新冠病毒系统发育树的超快速且内存高效的祖先谱系注释

F1ALA: ultrafast and memory-efficient ancestral lineage annotation applied to the huge SARS-CoV-2 phylogeny.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献