Zhai Jing, Kim Juhyun, Knox Kenneth S, Twigg Homer L, Zhou Hua, Zhou Jin J
Department of Epidemiology and Biostatistics, University of Arizona, Tucson, AZ, United States.
Department of Biostatistics, University of California, Los Angeles, Los Angeles, CA, United States.
Front Microbiol. 2018 Mar 28;9:509. doi: 10.3389/fmicb.2018.00509. eCollection 2018.
High-throughput sequencing technology has enabled population-based studies of the role of the human microbiome in disease etiology and exposure response. Microbiome data are summarized as counts or composition of the bacterial taxa at different taxonomic levels. An important problem is to identify the bacterial taxa that are associated with a response. One method is to test the association of specific taxon with phenotypes in a linear mixed effect model, which incorporates phylogenetic information among bacterial communities. Another type of approaches consider all taxa in a joint model and achieves selection via penalization method, which ignores phylogenetic information. In this paper, we consider regression analysis by treating bacterial taxa at different level as multiple random effects. For each taxon, a kernel matrix is calculated based on distance measures in the phylogenetic tree and acts as one variance component in the joint model. Then taxonomic selection is achieved by the lasso (least absolute shrinkage and selection operator) penalty on variance components. Our method integrates biological information into the variable selection problem and greatly improves selection accuracies. Simulation studies demonstrate the superiority of our methods versus existing methods, for example, group-lasso. Finally, we apply our method to a longitudinal microbiome study of Human Immunodeficiency Virus (HIV) infected patients. We implement our method using the high performance computing language Julia. Software and detailed documentation are freely available at https://github.com/JingZhai63/VCselection.
高通量测序技术使基于人群的人类微生物组在疾病病因和暴露反应中作用的研究成为可能。微生物组数据被总结为不同分类水平上细菌类群的计数或组成。一个重要问题是识别与反应相关的细菌类群。一种方法是在包含细菌群落间系统发育信息的线性混合效应模型中测试特定分类群与表型的关联。另一类方法在联合模型中考虑所有分类群,并通过惩罚方法进行选择,这种方法忽略了系统发育信息。在本文中,我们通过将不同水平的细菌类群视为多个随机效应来进行回归分析。对于每个分类群,基于系统发育树中的距离度量计算一个核矩阵,并将其作为联合模型中的一个方差分量。然后通过对方差分量施加套索(最小绝对收缩和选择算子)惩罚来实现分类选择。我们的方法将生物学信息整合到变量选择问题中,极大地提高了选择准确性。模拟研究证明了我们的方法相对于现有方法(如组套索)的优越性。最后,我们将我们的方法应用于一项对人类免疫缺陷病毒(HIV)感染患者的纵向微生物组研究。我们使用高性能计算语言Julia实现了我们的方法。软件和详细文档可在https://github.com/JingZhai63/VCselection免费获取。