School of Biomedical Informatics, University of Texas Health Science Center at Houston (UTHealth), Houston, Texas, USA.
Universidad Antonio Nariño, Bogotá, Colombia.
Database (Oxford). 2019 Jan 1;2019:bay145. doi: 10.1093/database/bay145.
Gene Expression Omnibus (GEO) and other publicly available data store their metadata in the format of unstructured English text, which is very difficult for automated reuse.
We employed text mining techniques to analyze the metadata of GEO and developed Restructured GEO database (ReGEO). ReGEO reorganizes and categorizes GEO series and makes them searchable by two new attributes extracted automatically from each series' metadata. These attributes are the number of time points tested in the experiment and the disease being investigated. ReGEO also makes series searchable by other attributes available in GEO, such as platform organism, experiment type, associated PubMed ID as well as general keywords in the study's description. Our approach greatly expands the usability of GEO data, demonstrating a credible approach to improve the utility of vast amount of publicly available data in the era of Big Data research.
基因表达综合数据库(GEO)和其他公开可用的数据以非结构化英文文本的形式存储其元数据,这使得自动化重用变得非常困难。
我们采用文本挖掘技术来分析 GEO 的元数据,并开发了重构基因表达数据库(ReGEO)。ReGEO 对 GEO 系列进行了重新组织和分类,并通过从每个系列元数据中自动提取的两个新属性来对其进行搜索。这些属性是实验中测试的时间点数量和正在研究的疾病。ReGEO 还可以通过 GEO 中提供的其他属性来搜索系列,例如平台生物、实验类型、相关 PubMed ID 以及研究描述中的一般关键字。我们的方法大大扩展了 GEO 数据的可用性,为大数据研究时代提高大量公开可用数据的实用性提供了一种可信的方法。