Zhang Qunyuan, Abel Haley, Wells Alan, Lenzini Petra, Gomez Felicia, Province Michael A, Templeton Alan A, Weinstock George M, Salzman Nita H, Borecki Ingrid B
Division of Statistical Genomics, Washington University School of Medicine, St. Louis, MO, USA, Department of Biology, Washington University, St. Louis, MO, USA, The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA and Department of Pediatrics, Medical College of Wisconsin, Milwaukee, WI, USA.
Division of Statistical Genomics, Washington University School of Medicine, St. Louis, MO, USA, Department of Biology, Washington University, St. Louis, MO, USA, The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA and Department of Pediatrics, Medical College of Wisconsin, Milwaukee, WI, USA Division of Statistical Genomics, Washington University School of Medicine, St. Louis, MO, USA, Department of Biology, Washington University, St. Louis, MO, USA, The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA and Department of Pediatrics, Medical College of Wisconsin, Milwaukee, WI, USA.
Bioinformatics. 2015 May 15;31(10):1607-13. doi: 10.1093/bioinformatics/btu855. Epub 2015 Jan 6.
Establishment of a statistical association between microbiome features and clinical outcomes is of growing interest because of the potential for yielding insights into biological mechanisms and pathogenesis. Extracting microbiome features that are relevant for a disease is challenging and existing variable selection methods are limited due to large number of risk factor variables from microbiome sequence data and their complex biological structure.
We propose a tree-based scanning method, Selection of Models for the Analysis of Risk factor Trees (referred to as SMART-scan), for identifying taxonomic groups that are associated with a disease or trait. SMART-scan is a model selection technique that uses a predefined taxonomy to organize the large pool of possible predictors into optimized groups, and hierarchically searches and determines variable groups for association test. We investigate the statistical properties of SMART-scan through simulations, in comparison to a regular single-variable analysis and three commonly-used variable selection methods, stepwise regression, least absolute shrinkage and selection operator (LASSO) and classification and regression tree (CART). When there are taxonomic group effects in the data, SMART-scan can significantly increase power by using bacterial taxonomic information to split large numbers of variables into groups. Through an application to microbiome data from a vervet monkey diet experiment, we demonstrate that SMART-scan can identify important phenotype-associated taxonomic features missed by single-variable analysis, stepwise regression, LASSO and CART.
由于微生物组特征与临床结果之间的统计关联有可能为生物学机制和发病机制提供见解,因此越来越受到关注。从微生物组序列数据中提取与疾病相关的微生物组特征具有挑战性,并且由于存在大量来自微生物组序列数据的风险因素变量及其复杂的生物学结构,现有的变量选择方法受到限制。
我们提出了一种基于树的扫描方法,即风险因素树分析模型选择(简称SMART-scan),用于识别与疾病或性状相关的分类群。SMART-scan是一种模型选择技术,它使用预定义的分类法将大量可能的预测变量组织成优化的组,并分层搜索和确定用于关联测试的变量组。与常规单变量分析和三种常用的变量选择方法(逐步回归、最小绝对收缩和选择算子(LASSO)以及分类和回归树(CART))相比,我们通过模拟研究了SMART-scan的统计特性。当数据中存在分类群效应时,SMART-scan可以通过利用细菌分类信息将大量变量分成组来显著提高检验效能。通过对来自黑长尾猴饮食实验的微生物组数据的应用,我们证明SMART-scan可以识别单变量分析、逐步回归、LASSO和CART遗漏的重要的与表型相关的分类特征。