Liu Ruitao, Qiao Xi, Shi Yushu, Peterson Christine B, Bush William S, Cominelli Fabio, Wang Ming, Zhang Liangliang
Department of Population and Quantitative Health Sciences, School of Medicine, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, 44106, OH, United States.
Weill Cornell Medicine, Cornell University, 1300 York Ave, New York, 10065, NY, United States.
Comput Struct Biotechnol J. 2024 Oct 24;23:3859-3868. doi: 10.1016/j.csbj.2024.10.032. eCollection 2024 Dec.
As next-generation sequencing technologies advance rapidly and the cost of metagenomic sequencing continues to decrease, researchers now face an unprecedented volume of microbiome data. This surge has stimulated the development of scalable microbiome data analysis methods and necessitated the incorporation of phylogenetic information into microbiome analysis for improved accuracy. Tools for constructing phylogenetic trees from 16S rRNA sequencing data are well-established, as the highly conserved regions of the 16S gene are limited, simplifying the identification of marker genes. In contrast, metagenomic and whole genome shotgun (WGS) sequencing involve sequencing from random fragments of the entire gene, making identification of consistent marker genes challenging owing to the vast diversity of genomic regions, resulting in a scarcity of robust tools for constructing phylogenetic trees. Although bacterial sequence tree construction tools exist for upstream bioinformatics, many downstream researchers-those integrating these trees into statistical models or machine learning-are either unaware of these tools or find them difficult to use due to the steep learning curve of processing raw sequences. This is compounded by the fact that public datasets often lack phylogenetic trees, providing only abundance tables and taxonomic classifications. To address this, we present a comprehensive review of phylogenetic tree construction techniques for microbiome data (16S rRNA or whole-genome shotgun sequencing). We outline the strengths and limitations of current methods, offering expert insights and step-by-step guidance to make these tools more accessible and widely applicable in quantitative microbiome data analysis.
随着下一代测序技术的迅速发展以及宏基因组测序成本的持续下降,研究人员如今面临着前所未有的大量微生物组数据。这种激增刺激了可扩展的微生物组数据分析方法的发展,并且有必要将系统发育信息纳入微生物组分析以提高准确性。从16S rRNA测序数据构建系统发育树的工具已经很成熟,因为16S基因的高度保守区域有限,简化了标记基因的识别。相比之下,宏基因组测序和全基因组鸟枪法测序涉及对整个基因的随机片段进行测序,由于基因组区域的巨大多样性,使得识别一致的标记基因具有挑战性,导致用于构建系统发育树的强大工具匮乏。尽管存在用于上游生物信息学的细菌序列树构建工具,但许多下游研究人员——即将这些树整合到统计模型或机器学习中的研究人员——要么不知道这些工具,要么由于处理原始序列的学习曲线陡峭而觉得难以使用。公共数据集通常缺乏系统发育树,仅提供丰度表和分类学分类,这使得情况更加复杂。为了解决这个问题,我们对微生物组数据(16S rRNA或全基因组鸟枪法测序)的系统发育树构建技术进行了全面综述。我们概述了当前方法的优缺点,提供专家见解和逐步指导,以使这些工具在定量微生物组数据分析中更易于使用和广泛应用。