Pookhao Naruekamol, Sohn Michael B, Li Qike, Jenkins Isaac, Du Ruofei, Jiang Hongmei, An Lingling
Department of Agricultural & Biosystems Engineering, Interdisciplinary Program in Statistics, University of Arizona, Tucson, AZ, 85721 and Department of Statistics, Northwestern University, Evanston, IL 60208, USA.
Department of Agricultural & Biosystems Engineering, Interdisciplinary Program in Statistics, University of Arizona, Tucson, AZ, 85721 and Department of Statistics, Northwestern University, Evanston, IL 60208, USA Department of Agricultural & Biosystems Engineering, Interdisciplinary Program in Statistics, University of Arizona, Tucson, AZ, 85721 and Department of Statistics, Northwestern University, Evanston, IL 60208, USA.
Bioinformatics. 2015 Jan 15;31(2):158-65. doi: 10.1093/bioinformatics/btu635. Epub 2014 Sep 24.
With the advance of new sequencing technologies producing massive short reads data, metagenomics is rapidly growing, especially in the fields of environmental biology and medical science. The metagenomic data are not only high dimensional with large number of features and limited number of samples but also complex with a large number of zeros and skewed distribution. Efficient computational and statistical tools are needed to deal with these unique characteristics of metagenomic sequencing data. In metagenomic studies, one main objective is to assess whether and how multiple microbial communities differ under various environmental conditions.
We propose a two-stage statistical procedure for selecting informative features and identifying differentially abundant features between two or more groups of microbial communities. In the functional analysis of metagenomes, the features may refer to the pathways, subsystems, functional roles and so on. In the first stage of the proposed procedure, the informative features are selected using elastic net as reducing the dimension of metagenomic data. In the second stage, the differentially abundant features are detected using generalized linear models with a negative binomial distribution. Compared with other available methods, the proposed approach demonstrates better performance for most of the comprehensive simulation studies. The new method is also applied to two real metagenomic datasets related to human health. Our findings are consistent with those in previous reports.
R code and two example datasets are available at http://cals.arizona.edu/∼anling/software.htm.
Supplementary file is available at Bioinformatics online.
随着能够产生大量短读长数据的新测序技术的发展,宏基因组学正在迅速发展,尤其是在环境生物学和医学领域。宏基因组数据不仅具有高维度,特征数量众多且样本数量有限,而且还很复杂,存在大量零值和偏态分布。需要高效的计算和统计工具来处理宏基因组测序数据的这些独特特征。在宏基因组研究中,一个主要目标是评估多个微生物群落在各种环境条件下是否存在差异以及如何存在差异。
我们提出了一种两阶段统计程序,用于选择信息特征并识别两组或多组微生物群落之间差异丰富的特征。在宏基因组的功能分析中,特征可能指途径、子系统、功能角色等。在所提出程序的第一阶段,使用弹性网络选择信息特征以降低宏基因组数据的维度。在第二阶段,使用具有负二项分布的广义线性模型检测差异丰富的特征。与其他现有方法相比,所提出的方法在大多数综合模拟研究中表现出更好的性能。该新方法还应用于两个与人类健康相关的真实宏基因组数据集。我们的发现与先前报告中的发现一致。
R代码和两个示例数据集可在http://cals.arizona.edu/∼anling/software.htm获得。
补充文件可在《生物信息学》在线版获得。