Suppr超能文献

预测CEDAR中的生物医学元数据:基因表达综合数据库(GEO)研究

Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO).

作者信息

Panahiazar Maryam, Dumontier Michel, Gevaert Olivier

机构信息

Stanford Center for Biomedical Informatics Research, Center for Data Annotation and Retrieval, Department of Medicine, Stanford University, Stanford, 94305, United States.

Stanford Center for Biomedical Informatics Research, Center for Data Annotation and Retrieval, Department of Medicine, Stanford University, Stanford, 94305, United States.

出版信息

J Biomed Inform. 2017 Aug;72:132-139. doi: 10.1016/j.jbi.2017.06.017. Epub 2017 Jun 16.

Abstract

A crucial and limiting factor in data reuse is the lack of accurate, structured, and complete descriptions of data, known as metadata. Towards improving the quantity and quality of metadata, we propose a novel metadata prediction framework to learn associations from existing metadata that can be used to predict metadata values. We evaluate our framework in the context of experimental metadata from the Gene Expression Omnibus (GEO). We applied four rule mining algorithms to the most common structured metadata elements (sample type, molecular type, platform, label type and organism) from over 1.3million GEO records. We examined the quality of well supported rules from each algorithm and visualized the dependencies among metadata elements. Finally, we evaluated the performance of the algorithms in terms of accuracy, precision, recall, and F-measure. We found that PART is the best algorithm outperforming Apriori, Predictive Apriori, and Decision Table. All algorithms perform significantly better in predicting class values than the majority vote classifier. We found that the performance of the algorithms is related to the dimensionality of the GEO elements. The average performance of all algorithm increases due of the decreasing of dimensionality of the unique values of these elements (2697 platforms, 537 organisms, 454 labels, 9 molecules, and 5 types). Our work suggests that experimental metadata such as present in GEO can be accurately predicted using rule mining algorithms. Our work has implications for both prospective and retrospective augmentation of metadata quality, which are geared towards making data easier to find and reuse.

摘要

数据重用中的一个关键限制因素是缺乏对数据的准确、结构化和完整描述,即元数据。为了提高元数据的数量和质量,我们提出了一种新颖的元数据预测框架,以从现有元数据中学习关联,从而可用于预测元数据值。我们在来自基因表达综合数据库(GEO)的实验性元数据背景下评估了我们的框架。我们将四种规则挖掘算法应用于来自超过130万条GEO记录中最常见的结构化元数据元素(样本类型、分子类型、平台、标签类型和生物体)。我们检查了每种算法中得到充分支持的规则的质量,并直观展示了元数据元素之间的依赖性关系。最后,我们从准确性、精确性、召回率和F值方面评估了算法的性能。我们发现PART是优于Apriori、Predictive Apriori和决策表的最佳算法。所有算法在预测类别值方面的表现都明显优于多数投票分类器。我们发现算法的性能与GEO元素的维度有关。由于这些元素唯一值的维度降低(2697个平台、537个生物体、454个标签、9种分子和5种类型),所有算法的平均性能有所提高。我们的工作表明,使用规则挖掘算法可以准确预测GEO中存在的实验性元数据。我们的工作对前瞻性和回顾性提高元数据质量都有影响,这有助于使数据更易于查找和重用。

相似文献

3
ALE: automated label extraction from GEO metadata.ALE:从 GEO 元数据中自动提取标签。
BMC Bioinformatics. 2017 Dec 28;18(Suppl 14):509. doi: 10.1186/s12859-017-1888-1.
8
Mining data and metadata from the gene expression omnibus.从基因表达综合数据库挖掘数据和元数据。
Biophys Rev. 2019 Feb;11(1):103-110. doi: 10.1007/s12551-018-0490-8. Epub 2018 Dec 29.

引用本文的文献

7
Mining data and metadata from the gene expression omnibus.从基因表达综合数据库挖掘数据和元数据。
Biophys Rev. 2019 Feb;11(1):103-110. doi: 10.1007/s12551-018-0490-8. Epub 2018 Dec 29.

本文引用的文献

3
The center for expanded data annotation and retrieval.扩展数据注释与检索中心
J Am Med Inform Assoc. 2015 Nov;22(6):1148-52. doi: 10.1093/jamia/ocv048. Epub 2015 Jun 25.
6
NCBI's Database of Genotypes and Phenotypes: dbGaP.NCBI 的基因型和表型数据库:dbGaP。
Nucleic Acids Res. 2014 Jan;42(Database issue):D975-9. doi: 10.1093/nar/gkt1211. Epub 2013 Dec 1.
8
NCBI GEO: archive for functional genomics data sets--update.NCBI GEO:功能基因组学数据集存档 - 更新。
Nucleic Acids Res. 2013 Jan;41(Database issue):D991-5. doi: 10.1093/nar/gks1193. Epub 2012 Nov 27.
10
RightField: embedding ontology annotation in spreadsheets.右外野:在电子表格中嵌入本体注释。
Bioinformatics. 2011 Jul 15;27(14):2021-2. doi: 10.1093/bioinformatics/btr312. Epub 2011 May 26.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验