Li Zhongyou, Koeppen Katja, Holden Victoria I, Neff Samuel L, Cengher Liviu, Demers Elora G, Mould Dallas L, Stanton Bruce A, Hampton Thomas H
Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, USA.
Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, USA
mSystems. 2021 Mar 23;6(2):e01305-20. doi: 10.1128/mSystems.01305-20.
The NCBI Gene Expression Omnibus (GEO) provides tools to query and download transcriptomic data. However, less than 4% of microbial experiments include the sample group annotations required to assess differential gene expression for high-throughput reanalysis, and data deposited after 2014 universally lack these annotations. Our algorithm GAUGE (general annotation using text/data group ensembles) automatically annotates GEO microbial data sets, including microarray and RNA sequencing studies, increasing the percentage of data sets amenable to analysis from 4% to 33%. Eighty-nine percent of GAUGE-annotated studies matched group assignments generated by human curators. To demonstrate how GAUGE annotation can lead to scientific insight, we created GAPE (GAUGE-annotated and transcriptomic compendia for reanalysis), a Shiny Web interface to analyze 73 GAUGE-annotated studies, three times more than previously available. GAPE analysis revealed that , a gene of unknown function, was frequently differentially expressed in more than 50% of studies and significantly coregulated with genes involved in biofilm formation. Follow-up wet-bench experiments demonstrate that mutants are indeed defective in biofilm formation, consistent with predictions facilitated by GAUGE and GAPE. We anticipate that GAUGE and GAPE, which we have made freely available, will make publicly available microbial transcriptomic data easier to reuse and lead to new data-driven hypotheses. GEO archives transcriptomic data from over 5,800 microbial experiments and allows researchers to answer questions not directly addressed in published papers. However, less than 4% of the microbial data sets include the sample group annotations required for high-throughput reanalysis. This limitation blocks a considerable amount of microbial transcriptomic data from being reused easily. Here, we demonstrate that the GAUGE algorithm could make 33% of microbial data accessible to parallel mining and reanalysis. GAUGE annotations increase statistical power and, thereby, make consistent patterns of differential gene expression easier to identify. In addition, we developed GAPE (GAUGE-annotated and transcriptomic compendia for reanalysis), a Shiny Web interface that performs parallel analyses on and compendia. Source code for GAUGE and GAPE is freely available and can be repurposed to create compendia for other bacterial species.
美国国家生物技术信息中心基因表达综合数据库(GEO)提供了查询和下载转录组数据的工具。然而,不到4%的微生物实验包含评估差异基因表达所需的样本组注释以便进行高通量重新分析,并且2014年之后存入的数据普遍缺乏这些注释。我们的算法GAUGE(使用文本/数据组集合进行通用注释)能自动注释GEO微生物数据集,包括微阵列和RNA测序研究,使适合分析的数据集比例从4%提高到33%。89%经GAUGE注释的研究与人工编目生成的组分配相匹配。为了证明GAUGE注释如何能带来科学见解,我们创建了GAPE(用于重新分析的GAUGE注释和转录组纲要),这是一个闪亮的网络界面,用于分析73项经GAUGE注释的研究,比之前可用的研究数量多两倍。GAPE分析显示,一个功能未知的基因在超过50%的研究中经常差异表达,并且与参与生物膜形成的基因显著共调控。后续的湿实验室实验表明,该基因的突变体在生物膜形成方面确实存在缺陷,这与GAUGE和GAPE促成的预测一致。我们预计,我们已免费提供的GAUGE和GAPE将使公开可用的微生物转录组数据更易于重新使用,并催生新的数据驱动假设。GEO存档了来自5800多个微生物实验的转录组数据,并允许研究人员回答已发表论文中未直接涉及的问题。然而,不到4%的微生物数据集包含高通量重新分析所需的样本组注释。这一限制阻碍了大量微生物转录组数据的轻松重新使用。在这里,我们证明GAUGE算法可以使33%的微生物数据可用于并行挖掘和重新分析。GAUGE注释提高了统计效力,从而使差异基因表达的一致模式更容易识别。此外,我们开发了GAPE(用于重新分析的GAUGE注释和转录组纲要),这是一个闪亮的网络界面,可对纲要进行并行分析。GAUGE和GAPE的源代码可免费获取,并且可以重新用于创建其他细菌物种的纲要。