Wang Shulei, Cai T Tony, Li Hongzhe
Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104.
Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104.
J Am Stat Assoc. 2021;116(535):1237-1253. doi: 10.1080/01621459.2019.1699422. Epub 2020 Jan 23.
The weighted UniFrac distance, a plug-in estimator of the Wasserstein distance of read counts on a tree, has been widely used to measure the microbial community difference in microbiome studies. Our investigation however shows that such a plug-in estimator, although intuitive and commonly used in practice, suffers from potential bias. Motivated by this finding, we study the problem of optimal estimation of the Wasserstein distance between two distributions on a tree from the sampled data in the high-dimensional setting. The minimax rate of convergence is established. To overcome the bias problem, we introduce a new estimator, referred to as the moment-screening estimator on a tree (MET), by using implicit best polynomial approximation that incorporates the tree structure. The new estimator is computationally efficient and is shown to be minimax rate-optimal. Numerical studies using both simulated and real biological datasets demonstrate the practical merits of MET, including reduced biases and statistically more significant differences in microbiome between the inactive Crohn's disease patients and the normal controls.
加权UniFrac距离是树上读取计数的Wasserstein距离的一种插件估计器,已被广泛用于测量微生物组研究中的微生物群落差异。然而,我们的研究表明,这种插件估计器虽然直观且在实践中常用,但存在潜在偏差。受这一发现的启发,我们研究了在高维设置下从采样数据中对树上两个分布之间的Wasserstein距离进行最优估计的问题。建立了极小极大收敛速率。为了克服偏差问题,我们引入了一种新的估计器,称为树上矩筛选估计器(MET),它使用了结合树结构的隐式最佳多项式逼近。新估计器计算效率高,并且被证明是极小极大速率最优的。使用模拟和真实生物数据集的数值研究证明了MET的实际优点,包括减少偏差以及在非活动型克罗恩病患者和正常对照之间微生物组的统计学上更显著的差异。