Pearson Talima, Okinaka Richard T, Foster Jeffrey T, Keim Paul
Center for Microbial Genetics and Genomics, Northern Arizona University, Flagstaff, AZ, USA.
Infect Genet Evol. 2009 Sep;9(5):1010-9. doi: 10.1016/j.meegid.2009.05.014. Epub 2009 May 27.
Phylogenetic hypotheses using whole genome sequences have the potential for unprecedented accuracy, yet a failure to understand issues associated with discovery bias, character sampling, and strain sampling can lead to highly erroneous conclusions. For microbial pathogens, phylogenies derived from whole genome sequences are becoming more common, as large numbers of characters distributed across entire genomes can yield extremely accurate phylogenies, particularly for strictly clonal populations. The availability of whole genomes is increasing as new sequencing technologies reduce the cost and time required for genome sequencing. Until entire sample collections can be fully sequenced, harnessing the phylogenetic power from whole genome sequences in more than a small subset of fully sequenced strains requires the integration of whole genome and partial genome genotyping data. Such integration involves discovering evolutionarily stable polymorphic characters by whole genome comparisons, then determining allelic states across a wide panel of isolates using high-throughput genotyping technologies. Here, we demonstrate how such an approach using single nucleotide polymorphisms (SNPs) yields highly accurate, but biased phylogenetic reconstructions and how the accuracy of the resulting tree is compromised by incomplete taxon and character sampling. Despite recent phylogenetic work detailing the strengths and biases of integrating whole genome and partial genome genotype data, these issues are relatively new and remain poorly understood by many researchers. Here, we revisit these biases and provide strategies for maximizing phylogenetic accuracy. Although we write this review with bacterial pathogens in mind, these concepts apply to any clonally reproducing population or indeed to any evolutionarily stable marker that is inherited in a strictly clonal manner. Understanding the ways in which current and emerging technologies can be used to maximize phylogenetic knowledge is advantageous only with a complete understanding of the strengths and weaknesses of these methods.
使用全基因组序列构建的系统发育假说有可能达到前所未有的准确性,然而,若未能理解与发现偏差、特征抽样和菌株抽样相关的问题,可能会导致得出极具错误性的结论。对于微生物病原体而言,源自全基因组序列的系统发育树正变得越来越普遍,因为分布于整个基因组的大量特征能够产生极其准确的系统发育树,尤其是对于严格克隆的群体。随着新测序技术降低了基因组测序所需的成本和时间,全基因组的可得性正在增加。在能够对整个样本集合进行全测序之前,要在超过一小部分已完全测序菌株中利用全基因组序列的系统发育能力,就需要整合全基因组和部分基因组的基因分型数据。这种整合包括通过全基因组比较发现进化上稳定的多态性特征,然后使用高通量基因分型技术确定广泛分离株中的等位基因状态。在此,我们展示了这种使用单核苷酸多态性(SNP)的方法如何产生高度准确但有偏差的系统发育重建,以及所得树的准确性如何因不完整的分类群和特征抽样而受到损害。尽管最近的系统发育研究详细阐述了整合全基因组和部分基因组基因型数据的优势和偏差,但这些问题相对较新,许多研究人员对此仍了解不足。在此,我们重新审视这些偏差,并提供使系统发育准确性最大化的策略。尽管我们撰写本综述时主要考虑的是细菌病原体,但这些概念适用于任何进行克隆繁殖的群体,实际上也适用于任何以严格克隆方式遗传的进化上稳定的标记。只有在全面了解这些方法的优缺点的情况下,理解如何利用现有和新兴技术来最大化系统发育知识才是有益的。