Franzosa Eric A, Huang Katherine, Meadow James F, Gevers Dirk, Lemon Katherine P, Bohannan Brendan J M, Huttenhower Curtis
Biostatistics Department, Harvard School of Public Health, Boston, MA 02115; Microbial Systems and Communities, Genome Sequencing and Analysis Program, The Broad Institute, Cambridge, MA 02142;
Microbial Systems and Communities, Genome Sequencing and Analysis Program, The Broad Institute, Cambridge, MA 02142;
Proc Natl Acad Sci U S A. 2015 Jun 2;112(22):E2930-8. doi: 10.1073/pnas.1423854112. Epub 2015 May 11.
Community composition within the human microbiome varies across individuals, but it remains unknown if this variation is sufficient to uniquely identify individuals within large populations or stable enough to identify them over time. We investigated this by developing a hitting set-based coding algorithm and applying it to the Human Microbiome Project population. Our approach defined body site-specific metagenomic codes: sets of microbial taxa or genes prioritized to uniquely and stably identify individuals. Codes capturing strain variation in clade-specific marker genes were able to distinguish among 100s of individuals at an initial sampling time point. In comparisons with follow-up samples collected 30-300 d later, ∼30% of individuals could still be uniquely pinpointed using metagenomic codes from a typical body site; coincidental (false positive) matches were rare. Codes based on the gut microbiome were exceptionally stable and pinpointed >80% of individuals. The failure of a code to match its owner at a later time point was largely explained by the loss of specific microbial strains (at current limits of detection) and was only weakly associated with the length of the sampling interval. In addition to highlighting patterns of temporal variation in the ecology of the human microbiome, this work demonstrates the feasibility of microbiome-based identifiability-a result with important ethical implications for microbiome study design. The datasets and code used in this work are available for download from huttenhower.sph.harvard.edu/idability.
人类微生物组中的群落组成因人而异,但这种差异是否足以在大群体中唯一识别个体,或者是否足够稳定以便随时间识别个体,目前尚不清楚。我们通过开发一种基于命中集的编码算法并将其应用于人类微生物组计划人群来对此进行研究。我们的方法定义了特定身体部位的宏基因组编码:一组经过优先排序的微生物分类群或基因,用于唯一且稳定地识别个体。捕获特定进化枝标记基因中菌株变异的编码能够在初始采样时间点区分数百个个体。与30 - 300天后收集的后续样本进行比较时,使用来自典型身体部位的宏基因组编码仍可唯一确定约30%的个体;巧合的(假阳性)匹配很少见。基于肠道微生物组的编码异常稳定,可确定超过80%的个体。编码在后续时间点未能与其所有者匹配,很大程度上是由于特定微生物菌株的丢失(在当前检测限内),并且与采样间隔的长度仅有微弱关联。除了突出人类微生物组生态中的时间变化模式外, 这项工作还证明了基于微生物组的可识别性的可行性——这一结果对微生物组研究设计具有重要的伦理意义。这项工作中使用的数据集和代码可从huttenhower.sph.harvard.edu/idability下载。