Department of Pediatrics, University of California San Diego School of Medicine, La Jolla, California, USA.
Bioinformatics and Systems Biology Program, University of California, San Diegogrid.266100.3, La Jolla, California, USA.
mSystems. 2022 Jun 28;7(3):e0005022. doi: 10.1128/msystems.00050-22. Epub 2022 Apr 28.
Microbiome data have several specific characteristics (sparsity and compositionality) that introduce challenges in data analysis. The integration of prior information regarding the data structure, such as phylogenetic structure and repeated-measure study designs, into analysis, is an effective approach for revealing robust patterns in microbiome data. Past methods have addressed some but not all of these challenges and features: for example, robust principal-component analysis (RPCA) addresses sparsity and compositionality; compositional tensor factorization (CTF) addresses sparsity, compositionality, and repeated measure study designs; and UniFrac incorporates phylogenetic information. Here we introduce a strategy of incorporating phylogenetic information into RPCA and CTF. The resulting methods, phylo-RPCA, and phylo-CTF, provide substantial improvements over state-of-the-art methods in terms of discriminatory power of underlying clustering ranging from the mode of delivery to adult human lifestyle. We demonstrate quantitatively that the addition of phylogenetic information improves effect size and classification accuracy in both data-driven simulated data and real microbiome data. Microbiome data analysis can be difficult because of particular data features, some unavoidable and some due to technical limitations of DNA sequencing instruments. The first step in many analyses that ultimately reveals patterns of similarities and differences among sets of samples (e.g., separating samples from sick and healthy people or samples from seawater versus soil) is calculating the difference between each pair of samples. We introduce two new methods to calculate these differences that combine features of past methods, specifically being able to take into account the principles that most types of microbes are not in most samples (sparsity), that abundances are relative rather than absolute (compositionality), and that all microbes have a shared evolutionary history (phylogeny). We show using simulated and real data that our new methods provide improved classification accuracy of ordinal sample clusters and increased effect size between sample groups on beta-diversity distances.
微生物组数据具有一些特定的特征(稀疏性和组成性),这给数据分析带来了挑战。将关于数据结构的先验信息(如系统发育结构和重复测量研究设计)整合到分析中,是揭示微生物组数据中稳健模式的有效方法。过去的方法已经解决了其中的一些但不是所有的挑战和特征:例如,稳健主成分分析(RPCA)解决了稀疏性和组成性问题;组合张量分解(CTF)解决了稀疏性、组成性和重复测量研究设计问题;而 UniFrac 则整合了系统发育信息。在这里,我们介绍了一种将系统发育信息整合到 RPCA 和 CTF 中的策略。由此产生的方法 phylo-RPCA 和 phylo-CTF,在从分娩方式到成人生活方式的潜在聚类的区分能力方面,与最先进的方法相比有了显著的提高。我们定量地证明了在数据驱动的模拟数据和真实微生物组数据中,添加系统发育信息可以提高效应大小和分类准确性。微生物组数据分析可能很困难,因为数据特征具有特殊性,有些是不可避免的,有些则是由于 DNA 测序仪器的技术限制。许多分析的第一步最终揭示了样本集之间相似性和差异性的模式(例如,将来自健康人和病人的样本或来自海水和土壤的样本分开),就是计算每对样本之间的差异。我们引入了两种新的方法来计算这些差异,这些方法结合了过去方法的特点,特别是能够考虑到大多数类型的微生物在大多数样本中不存在(稀疏性)、丰度是相对的而不是绝对的(组成性)以及所有微生物都有共同的进化历史(系统发育)的原则。我们使用模拟和真实数据表明,我们的新方法提供了改进的有序样本聚类的分类准确性和样本组之间的β多样性距离的更大效应大小。