Ye Yongtao, Shum Marcus H, Wu Isaac, Chau Carlos, Zhao Ningqi, Smith David K, Wu Joseph T, Lam Tommy T
State Key Laboratory of Emerging Infectious Diseases, School of Public Health, The University of Hong Kong, Hong Kong SAR, P. R. China.
Laboratory of Data Discovery for Health, 19W Hong Kong Science & Technology Parks, Hong Kong SAR, P. R. China.
Virus Evol. 2024 Jul 25;10(1):veae056. doi: 10.1093/ve/veae056. eCollection 2024.
The unprecedentedly large size of the global SARS-CoV-2 phylogeny makes any computation on the tree difficult. Lineage identification (e.g. the PANGO nomenclature for SARS-CoV-2) and assignment are key to track the virus evolution. It requires annotating clade roots of lineages to unlabeled ancestral nodes in a phylogenetic tree. Then the lineage labels of descendant samples under these clade roots can be inferred to be the corresponding lineages. This is the ancestral lineage annotation problem, and matUtils (a package in pUShER) and PastML are commonly used methods. However, their computational tractability is a challenge and their accuracy needs further exploration in huge SARS-CoV-2 phylogenies. We have developed an efficient and accurate method, called "F1ALA", that utilizes the F1-score to evaluate the confidence with which a specific ancestral node can be annotated as the clade root of a lineage, given the lineage labels of a set of taxa in a rooted tree. Compared to these methods, F1ALA achieved roughly an order of magnitude faster yet with ∼12% of their memory usage when annotating 2277 PANGO lineages in a phylogeny of 5.26 million taxa. F1ALA allows real-time lineage tracking to be performed on a laptop computer. F1ALA outperformed matUtils (pUShER) with statistical significance, and had comparable accuracy to PastML in tests on empirical and simulated data. F1ALA enables a tree refinement by pruning taxa with inconsistent labels to their closest annotation nodes and re-inserting them back to the pruned tree to improve a SARS-CoV-2 phylogeny with both higher log-likelihood and lower parsimony score. Given the ultrafast speed and high accuracy, we anticipated that F1ALA will also be useful for large phylogenies of other viruses. Codes and benchmark datasets are publicly available at https://github.com/id-bioinfo/F1ALA.
全球严重急性呼吸综合征冠状病毒2(SARS-CoV-2)系统发育树的规模空前庞大,这使得对该树进行任何计算都很困难。谱系识别(例如SARS-CoV-2的PANGO命名法)和归类是追踪病毒进化的关键。这需要在系统发育树中将谱系的分支根部注释到未标记的祖先节点。然后,可以推断这些分支根部下后代样本的谱系标签为相应的谱系。这就是祖先谱系注释问题,matUtils(pUShER中的一个软件包)和PastML是常用的方法。然而,它们的计算可处理性是一个挑战,并且在庞大的SARS-CoV-2系统发育树中,它们的准确性需要进一步探索。我们开发了一种高效且准确的方法,称为“F1ALA”,该方法利用F1分数来评估在给定有根树中一组分类单元的谱系标签的情况下,将特定祖先节点注释为谱系分支根部的置信度。与这些方法相比,在对526万个分类单元的系统发育树中注释2277个PANGO谱系时,F1ALA的速度快了大约一个数量级,而内存使用量仅为它们的12%左右。F1ALA允许在笔记本电脑上进行实时谱系追踪。在对经验数据和模拟数据的测试中,F1ALA的性能在统计学上显著优于matUtils(pUShER),并且与PastML的准确性相当。F1ALA能够通过将标签不一致的分类单元修剪到其最接近的注释节点,然后将它们重新插入到修剪后的树中,来细化树,从而改进SARS-CoV-2系统发育树,使其具有更高的对数似然性和更低的简约得分。鉴于其超快的速度和高精度,我们预计F1ALA也将对其他病毒的大型系统发育树有用。代码和基准数据集可在https://github.com/id-bioinfo/F1ALA上公开获取。