Department of Economics, Management and Statistics, University of Milano-Bicocca, 20126 Milano, Italy.
Institute of Statistics & Mathematical Methods in Economics, Vienna University of Technology, 1040 Vienna, Austria.
Bioinformatics. 2021 Nov 5;37(21):3805-3814. doi: 10.1093/bioinformatics/btab572.
High-throughput sequencing technologies generate a huge amount of data, permitting the quantification of microbiome compositions. The obtained data are essentially sparse compositional data vectors, namely vectors of bacterial gene proportions which compose the microbiome. Subsequently, the need for statistical and computational methods that consider the special nature of microbiome data has increased. A critical aspect in microbiome research is to identify microbes associated with a clinical outcome. Another crucial aspect with high-dimensional data is the detection of outlying observations, whose presence affects seriously the prediction accuracy.
In this article, we connect robustness and sparsity in the context of variable selection in regression with compositional covariates with a continuous response. The compositional character of the covariates is taken into account by a linear log-contrast model, and elastic-net regularization achieves sparsity in the regression coefficient estimates. Robustness is obtained by performing trimming in the objective function of the estimator. A reweighting step increases the efficiency of the estimator, and it also allows for diagnostics in terms of outlier identification. The numerical performance of the proposed method is evaluated via simulation studies, and its usefulness is illustrated by an application to a microbiome study with the aim to predict caffeine intake based on the human gut microbiome composition.
The R-package 'RobZS' can be downloaded at https://github.com/giannamonti/RobZS.
Supplementary data are available at Bioinformatics online.
高通量测序技术产生了大量的数据,允许对微生物组组成进行定量。所获得的数据本质上是稀疏的组成数据向量,即构成微生物组的细菌基因比例的向量。随后,需要考虑微生物组数据特殊性质的统计和计算方法。微生物组研究的一个关键方面是识别与临床结果相关的微生物。高维数据的另一个关键方面是检测异常观测值,其存在严重影响预测准确性。
在本文中,我们将回归中与具有连续响应的组成协变量的变量选择相关的稳健性和稀疏性联系起来。协变量的组成性质通过线性对数对比模型来考虑,弹性网络正则化实现了回归系数估计的稀疏性。稳健性是通过在估计量的目标函数中进行修剪来获得的。重新加权步骤提高了估计量的效率,并且还允许进行异常值识别方面的诊断。通过模拟研究评估了所提出方法的数值性能,并通过将其应用于微生物组研究来预测咖啡因摄入量的目的,说明了其在人类肠道微生物组组成方面的有用性。
可以在 https://github.com/giannamonti/RobZS 上下载 R 包“RobZS”。
补充数据可在生物信息学在线获得。