Department of Computer Science & Engineering, University of California San Diego, La Jolla, CA 92093, USA.
Viruses. 2022 Apr 8;14(4):774. doi: 10.3390/v14040774.
The use of viral sequence data to inform public health intervention has become increasingly common in the realm of epidemiology. Such methods typically utilize multiple sequence alignments and phylogenies estimated from the sequence data. Like all estimation techniques, they are error prone, yet the impacts of such imperfections on downstream epidemiological inferences are poorly understood. To address this, we executed multiple commonly used viral phylogenetic analysis workflows on simulated viral sequence data, modeling Human Immunodeficiency Virus (HIV), Hepatitis C Virus (HCV), and Ebolavirus, and we computed multiple methods of accuracy, motivated by transmission-clustering techniques. For multiple sequence alignment, MAFFT consistently outperformed MUSCLE and Clustal Omega, in both accuracy and runtime. For phylogenetic inference, FastTree 2, IQ-TREE, RAxML-NG, and PhyML had similar topological accuracies, but branch lengths and pairwise distances were consistently most accurate in phylogenies inferred by RAxML-NG. However, FastTree 2 was the fastest, by orders of magnitude, and when the other tools were used to optimize branch lengths along a fixed FastTree 2 topology, the resulting phylogenies had accuracies that were indistinguishable from their original counterparts, but with a fraction of the runtime.
利用病毒序列数据为公共卫生干预提供信息,在流行病学领域已经变得越来越普遍。这些方法通常利用来自序列数据的多重序列比对和系统发育估计。与所有估计技术一样,它们容易出错,但这些不完美对下游流行病学推断的影响还了解甚少。为了解决这个问题,我们对模拟的病毒序列数据执行了多个常用的病毒系统发育分析工作流程,模拟了人类免疫缺陷病毒(HIV)、丙型肝炎病毒(HCV)和埃博拉病毒,并根据传播聚类技术计算了多种准确性方法。对于多重序列比对,MAFFT 在准确性和运行时间方面始终优于 MUSCLE 和 Clustal Omega。对于系统发育推断,FastTree 2、IQ-TREE、RAxML-NG 和 PhyML 的拓扑准确性相似,但在 RAxML-NG 推断的系统发育中,分支长度和成对距离始终最准确。然而,FastTree 2 的速度快了好几个数量级,当使用其他工具沿着固定的 FastTree 2 拓扑优化分支长度时,所得系统发育的准确性与原始系统发育无法区分,但运行时间却大大缩短。