Wang Shulei, Yuan Bo, Tony Cai T, Li Hongzhe
Department of Statistics, University of Illinois at Urbana-Champaign, 725 South Wright Street, Champaign, Illinois 61820, U.S.A.
Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104, U.S.A.
Biometrika. 2023 Dec 1;111(3):881-902. doi: 10.1093/biomet/asad075. eCollection 2024 Sep.
Phylogenetic association analysis plays a crucial role in investigating the correlation between microbial compositions and specific outcomes of interest in microbiome studies. However, existing methods for testing such associations have limitations related to the assumption of a linear association in high-dimensional settings and the handling of confounding effects. Hence, there is a need for methods capable of characterizing complex associations, including nonmonotonic relationships. This article introduces a novel phylogenetic association analysis framework and associated tests to address these challenges by employing conditional rank correlation as a measure of association. The proposed tests account for confounders in a fully nonparametric manner, ensuring robustness against outliers and the ability to detect diverse dependencies. The proposed framework aggregates conditional rank correlations for subtrees using weighted sum and maximum approaches to capture both dense and sparse signals. The significance level of the test statistics is determined by calibration through a nearest-neighbour bootstrapping method, which is straightforward to implement and can accommodate additional datasets when these are available. The practical advantages of the proposed framework are demonstrated through numerical experiments using both simulated and real microbiome datasets.
系统发育关联分析在微生物组研究中调查微生物组成与感兴趣的特定结果之间的相关性方面起着至关重要的作用。然而,现有的测试此类关联的方法存在与高维环境中线性关联假设以及混杂效应处理相关的局限性。因此,需要能够表征复杂关联(包括非单调关系)的方法。本文介绍了一种新颖的系统发育关联分析框架及相关测试,通过采用条件秩相关作为关联度量来应对这些挑战。所提出的测试以完全非参数的方式考虑混杂因素,确保对异常值具有鲁棒性,并能够检测各种依赖性。所提出的框架使用加权和与最大方法聚合子树的条件秩相关,以捕获密集和稀疏信号。测试统计量的显著性水平通过最近邻自举法校准来确定,该方法易于实现,并且在有可用的额外数据集时可以容纳这些数据集。通过使用模拟和真实微生物组数据集的数值实验证明了所提出框架的实际优势。