Helmholtz Zentrum München, German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany.
Institute of Molecular Infection Biology, University of Würzburg, Würzburg, Germany.
Gigascience. 2018 Jun 1;7(6). doi: 10.1093/gigascience/giy070.
With the advent of the age of big data in bioinformatics, large volumes of data and high-performance computing power enable researchers to perform re-analyses of publicly available datasets at an unprecedented scale. Ever more studies imply the microbiome in both normal human physiology and a wide range of diseases. RNA sequencing technology (RNA-seq) is commonly used to infer global eukaryotic gene expression patterns under defined conditions, including human disease-related contexts; however, its generic nature also enables the detection of microbial and viral transcripts.
We developed a bioinformatic pipeline to screen existing human RNA-seq datasets for the presence of microbial and viral reads by re-inspecting the non-human-mapping read fraction. We validated this approach by recapitulating outcomes from six independent, controlled infection experiments of cell line models and compared them with an alternative metatranscriptomic mapping strategy. We then applied the pipeline to close to 150 terabytes of publicly available raw RNA-seq data from more than 17,000 samples from more than 400 studies relevant to human disease using state-of-the-art high-performance computing systems. The resulting data from this large-scale re-analysis are made available in the presented MetaMap resource.
Our results demonstrate that common human RNA-seq data, including those archived in public repositories, might contain valuable information to correlate microbial and viral detection patterns with diverse diseases. The presented MetaMap database thus provides a rich resource for hypothesis generation toward the role of the microbiome in human disease. Additionally, codes to process new datasets and perform statistical analyses are made available.
随着生物信息学大数据时代的到来,大量的数据和高性能计算能力使研究人员能够以前所未有的规模对公开可用的数据集进行重新分析。越来越多的研究表明,微生物组在正常人体生理学和广泛的疾病中都有作用。RNA 测序技术(RNA-seq)常用于推断特定条件下(包括与人类疾病相关的环境)的真核生物整体基因表达模式;然而,其通用性也使微生物和病毒转录本的检测成为可能。
我们开发了一种生物信息学管道,通过重新检查非人类映射读段,筛选现有的人类 RNA-seq 数据集,以确定微生物和病毒读段的存在。我们通过重新分析六个独立的细胞系模型的受控感染实验的结果,并将其与替代的宏转录组映射策略进行比较,验证了这种方法。然后,我们使用最先进的高性能计算系统,对来自 400 多项与人类疾病相关的研究的超过 17000 个样本、近 150TB 的公开可用原始 RNA-seq 数据进行了大规模的重新分析。从这项大规模重新分析中获得的数据在呈现的 MetaMap 资源中可用。
我们的研究结果表明,常见的人类 RNA-seq 数据,包括那些存档在公共存储库中的数据,可能包含有价值的信息,可以将微生物和病毒检测模式与各种疾病相关联。因此,呈现的 MetaMap 数据库为微生物组在人类疾病中的作用提供了丰富的假说生成资源。此外,还提供了处理新数据集和执行统计分析的代码。