高维组合数据的稳健协方差估计及其在微生物群落分析中的应用。

Robust covariance estimation for high-dimensional compositional data with application to microbial communities analysis.

机构信息

Zhongtai Securities Institute for Financial Studies, Shandong University, Jinan, Shandong, China.

School of Mathematics and Statistics and Research Institute of Mathematical Sciences, Jiangsu Normal University, Xuzhou, Jiangsu, China.

出版信息

Stat Med. 2021 Jul 10;40(15):3499-3515. doi: 10.1002/sim.8979. Epub 2021 Apr 11.

DOI:10.1002/sim.8979

PMID:33840134

Abstract

Microbial communities analysis is drawing growing attention due to the rapid development fire of high-throughput sequencing techniques nowadays. The observed data has the following typical characteristics: it is high-dimensional, compositional (lying in a simplex) and even would be leptokurtic and highly skewed due to the existence of overly abundant taxa, which makes the conventional correlation analysis infeasible to study the co-occurrence and co-exclusion relationship between microbial taxa. In this article, we address the challenges of covariance estimation for this kind of data. Assuming the basis covariance matrix lying in a well-recognized class of sparse covariance matrices, we adopt a proxy matrix known as centered log-ratio covariance matrix in the literature. We construct a Median-of-Means estimator for the centered log-ratio covariance matrix and propose a thresholding procedure that is adaptive to the variability of individual entries. By imposing a much weaker finite fourth moment condition compared with the sub-Gaussianity condition in the literature, we derive the optimal rate of convergence under the spectral norm. In addition, we also provide theoretical guarantee on support recovery. The adaptive thresholding procedure of the MOM estimator is easy to implement and gains robustness when outliers or heavy-tailedness exist. Thorough simulation studies are conducted to show the advantages of the proposed procedure over some state-of-the-arts methods. At last, we apply the proposed method to analyze a microbiome dataset in human gut.

摘要

由于高通量测序技术的快速发展，微生物群落分析越来越受到关注。观测数据具有以下典型特征：它是高维的、组成的（位于单形体内），甚至由于过度丰富的分类存在，会出现尖峰和高度偏态，这使得传统的相关分析方法无法研究微生物分类之间的共现和互斥关系。在本文中，我们解决了这种数据的协方差估计的挑战。假设基础协方差矩阵位于一个公认的稀疏协方差矩阵类中，我们采用文献中称为中心对数比协方差矩阵的代理矩阵。我们为中心对数比协方差矩阵构建了一个中位数均值估计量，并提出了一种自适应于各个条目变异性的阈值处理程序。通过施加比文献中的次高斯条件弱得多的有限四阶矩条件，我们在谱范数下推导出最优的收敛速度。此外，我们还提供了关于支持恢复的理论保证。MOM 估计量的自适应阈值处理程序易于实现，并且在存在离群值或重尾时具有稳健性。我们进行了彻底的模拟研究，以显示所提出的方法相对于一些最先进的方法的优势。最后，我们将所提出的方法应用于分析人类肠道中的微生物组数据集。

相似文献

Robust covariance estimation for high-dimensional compositional data with application to microbial communities analysis.高维组合数据的稳健协方差估计及其在微生物群落分析中的应用。

Stat Med. 2021 Jul 10;40(15):3499-3515. doi: 10.1002/sim.8979. Epub 2021 Apr 11.

Robust Covariance Matrix Estimation for High-Dimensional Compositional Data with Application to Sales Data Analysis.用于高维成分数据的稳健协方差矩阵估计及其在销售数据分析中的应用

J Bus Econ Stat. 2023;41(4):1090-1100. doi: 10.1080/07350015.2022.2106990. Epub 2022 Sep 21.

Inference for High-dimensional Differential Correlation Matrices.高维差分相关矩阵的推断

J Multivar Anal. 2016 Jan 1;143:107-126. doi: 10.1016/j.jmva.2015.08.019.

Large Covariance Estimation by Thresholding Principal Orthogonal Complements.通过阈值化主正交补进行大协方差估计

J R Stat Soc Series B Stat Methodol. 2013 Sep 1;75(4). doi: 10.1111/rssb.12016.

Minimax Rate-optimal Estimation of High-dimensional Covariance Matrices with Incomplete Data.具有不完全数据的高维协方差矩阵的极小极大速率最优估计

J Multivar Anal. 2016 Sep;150:55-74. doi: 10.1016/j.jmva.2016.05.002. Epub 2016 May 19.

Generalized Hotelling's test for paired compositional data with application to human microbiome studies.用于配对成分数据的广义霍特林检验及其在人类微生物组研究中的应用。

Genet Epidemiol. 2018 Jul;42(5):459-469. doi: 10.1002/gepi.22127. Epub 2018 May 7.

gCoda: Conditional Dependence Network Inference for Compositional Data.gCoda：成分数据的条件依赖网络推断

J Comput Biol. 2017 Jul;24(7):699-708. doi: 10.1089/cmb.2017.0054. Epub 2017 May 10.

Robust estimation of high-dimensional covariance and precision matrices.高维协方差矩阵和精度矩阵的稳健估计。

Biometrika. 2018 Jun 1;105(2):271-284. doi: 10.1093/biomet/asy011. Epub 2018 Mar 27.

A SHRINKAGE PRINCIPLE FOR HEAVY-TAILED DATA: HIGH-DIMENSIONAL ROBUST LOW-RANK MATRIX RECOVERY.重尾数据的收缩原理：高维稳健低秩矩阵恢复

Ann Stat. 2021 Jun;49(3):1239-1266. doi: 10.1214/20-aos1980. Epub 2021 Aug 9.

A maximum-type microbial differential abundance test with application to high-dimensional microbiome data analyses.一种基于最大似然的微生物差异丰度检验方法及其在高维微生物组数据分析中的应用。

Front Cell Infect Microbiol. 2022 Oct 28;12:988717. doi: 10.3389/fcimb.2022.988717. eCollection 2022.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验