Parasitic Diseases Branch, Division of Parasitic Diseases and Malaria, Centers for Disease Control and Prevention, Atlanta, GA, USA; Oak Ridge Associated Universities, Oak Ridge, TN, USA.
Parasitic Diseases Branch, Division of Parasitic Diseases and Malaria, Centers for Disease Control and Prevention, Atlanta, GA, USA; Eagle Global Scientific, San Antonio, TX, USA.
Mol Phylogenet Evol. 2022 Dec;177:107608. doi: 10.1016/j.ympev.2022.107608. Epub 2022 Aug 11.
Multi-locus sequence typing (MLST) is widely used to investigate genetic relationships among eukaryotic taxa, including parasitic pathogens. MLST analysis workflows typically involve construction of alignment-based phylogenetic trees - i.e., where tree structures are computed from nucleotide differences observed in a multiple sequence alignment (MSA). Notably, alignment-based phylogenetic methods require that all isolates/taxa are represented by a single sequence. When multiple loci are sequenced these sequences may be concatenated to produce one tree that includes information from all loci. Alignment-based phylogenetic techniques are robust and widely used yet possess some shortcomings, including how heterozygous sites are handled, intolerance for missing data (i.e., partial genotypes), and differences in the way insertions-deletions (indels) are scored/treated during tree construction. In certain contexts, 'haplotype-based' methods may represent a viable alternative to alignment-based techniques, as they do not possess the aforementioned limitations. This is namely because haplotype-based methods assess genetic similarity based on numbers of shared (i.e., intersecting) haplotypes as opposed to similarities in nucleotide composition observed in an MSA. For haplotype-based comparisons, choosing an appropriate distance statistic is fundamental, and several statistics are available to choose from. However, a comprehensive assessment of various available statistics for their ability to produce a robust haplotype-based phylogenetic reconstruction has not yet been performed. We evaluated seven distance statistics by applying them to extant MLST datasets from the gastrointestinal parasite Cyclospora cayetanensis and two species of pathogenic nematode of the genus Strongyloides. We compare the genetic relationships identified using each statistic to epidemiologic, geographic, and host metadata. We show that Barratt's heuristic definition of genetic distance was the most robust among the statistics evaluated. Consequently, it is proposed that Barratt's heuristic represents a useful approach for use in the context of challenging MLST datasets possessing features (i.e., high heterozygosity, partial genotypes, and indel or repeat-based polymorphisms) that confound or preclude the use of alignment-based methods.
多位点序列分型(MLST)广泛用于研究真核生物分类群的遗传关系,包括寄生虫病原体。MLST 分析工作流程通常涉及构建基于比对的系统发育树,即树结构是根据在多序列比对(MSA)中观察到的核苷酸差异计算得出的。值得注意的是,基于比对的系统发育方法要求所有分离株/分类群都由单个序列表示。当对多个基因座进行测序时,可以将这些序列串联起来,生成一棵包含所有基因座信息的树。基于比对的系统发育技术是强大且广泛使用的,但也存在一些缺点,包括如何处理杂合位点、对缺失数据(即部分基因型)的不宽容以及在构建树时插入缺失(indels)的评分/处理方式的差异。在某些情况下,基于单倍型的方法可能是基于比对技术的可行替代方法,因为它们没有上述限制。这是因为基于单倍型的方法基于共享(即相交)单倍型的数量评估遗传相似性,而不是基于 MSA 中观察到的核苷酸组成的相似性。对于基于单倍型的比较,选择适当的距离统计量是基础,并且有几种统计量可供选择。然而,尚未对各种可用统计量在产生稳健基于单倍型的系统发育重建方面的能力进行全面评估。我们应用于现有的胃肠道寄生虫 Cyclospora cayetanensis 和两种致病性 Strongyloides 属线虫的 MLST 数据集,评估了七种距离统计量。我们将使用每种统计量识别的遗传关系与流行病学、地理和宿主元数据进行比较。我们表明,Barratt 的遗传距离启发式定义在评估的统计量中最为稳健。因此,建议 Barratt 的启发式定义代表了一种在具有特征(即高杂合性、部分基因型以及基于插入缺失或重复的多态性)的具有挑战性的 MLST 数据集的背景下有用的方法,这些特征会干扰或排除基于比对的方法的使用。