• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

预测CEDAR中的生物医学元数据:基因表达综合数据库(GEO)研究

Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO).

作者信息

Panahiazar Maryam, Dumontier Michel, Gevaert Olivier

机构信息

Stanford Center for Biomedical Informatics Research, Center for Data Annotation and Retrieval, Department of Medicine, Stanford University, Stanford, 94305, United States.

Stanford Center for Biomedical Informatics Research, Center for Data Annotation and Retrieval, Department of Medicine, Stanford University, Stanford, 94305, United States.

出版信息

J Biomed Inform. 2017 Aug;72:132-139. doi: 10.1016/j.jbi.2017.06.017. Epub 2017 Jun 16.

DOI:10.1016/j.jbi.2017.06.017
PMID:28625880
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5643580/
Abstract

A crucial and limiting factor in data reuse is the lack of accurate, structured, and complete descriptions of data, known as metadata. Towards improving the quantity and quality of metadata, we propose a novel metadata prediction framework to learn associations from existing metadata that can be used to predict metadata values. We evaluate our framework in the context of experimental metadata from the Gene Expression Omnibus (GEO). We applied four rule mining algorithms to the most common structured metadata elements (sample type, molecular type, platform, label type and organism) from over 1.3million GEO records. We examined the quality of well supported rules from each algorithm and visualized the dependencies among metadata elements. Finally, we evaluated the performance of the algorithms in terms of accuracy, precision, recall, and F-measure. We found that PART is the best algorithm outperforming Apriori, Predictive Apriori, and Decision Table. All algorithms perform significantly better in predicting class values than the majority vote classifier. We found that the performance of the algorithms is related to the dimensionality of the GEO elements. The average performance of all algorithm increases due of the decreasing of dimensionality of the unique values of these elements (2697 platforms, 537 organisms, 454 labels, 9 molecules, and 5 types). Our work suggests that experimental metadata such as present in GEO can be accurately predicted using rule mining algorithms. Our work has implications for both prospective and retrospective augmentation of metadata quality, which are geared towards making data easier to find and reuse.

摘要

数据重用中的一个关键限制因素是缺乏对数据的准确、结构化和完整描述,即元数据。为了提高元数据的数量和质量,我们提出了一种新颖的元数据预测框架,以从现有元数据中学习关联,从而可用于预测元数据值。我们在来自基因表达综合数据库(GEO)的实验性元数据背景下评估了我们的框架。我们将四种规则挖掘算法应用于来自超过130万条GEO记录中最常见的结构化元数据元素(样本类型、分子类型、平台、标签类型和生物体)。我们检查了每种算法中得到充分支持的规则的质量,并直观展示了元数据元素之间的依赖性关系。最后,我们从准确性、精确性、召回率和F值方面评估了算法的性能。我们发现PART是优于Apriori、Predictive Apriori和决策表的最佳算法。所有算法在预测类别值方面的表现都明显优于多数投票分类器。我们发现算法的性能与GEO元素的维度有关。由于这些元素唯一值的维度降低(2697个平台、537个生物体、454个标签、9种分子和5种类型),所有算法的平均性能有所提高。我们的工作表明,使用规则挖掘算法可以准确预测GEO中存在的实验性元数据。我们的工作对前瞻性和回顾性提高元数据质量都有影响,这有助于使数据更易于查找和重用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6462/5643580/061d932fac5e/nihms894334f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6462/5643580/c65c757c6cdc/nihms894334f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6462/5643580/441b45c1c38a/nihms894334f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6462/5643580/9bad539f715d/nihms894334f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6462/5643580/cadfed389134/nihms894334f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6462/5643580/061d932fac5e/nihms894334f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6462/5643580/c65c757c6cdc/nihms894334f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6462/5643580/441b45c1c38a/nihms894334f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6462/5643580/9bad539f715d/nihms894334f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6462/5643580/cadfed389134/nihms894334f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6462/5643580/061d932fac5e/nihms894334f5.jpg

相似文献

1
Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO).预测CEDAR中的生物医学元数据:基因表达综合数据库(GEO)研究
J Biomed Inform. 2017 Aug;72:132-139. doi: 10.1016/j.jbi.2017.06.017. Epub 2017 Jun 16.
2
Predicting structured metadata from unstructured metadata.从非结构化元数据预测结构化元数据。
Database (Oxford). 2016 Jan 1;2016. doi: 10.1093/database/baw080.
3
ALE: automated label extraction from GEO metadata.ALE:从 GEO 元数据中自动提取标签。
BMC Bioinformatics. 2017 Dec 28;18(Suppl 14):509. doi: 10.1186/s12859-017-1888-1.
4
Restructured GEO: restructuring Gene Expression Omnibus metadata for genome dynamics analysis.重构 GEO:用于基因组动态分析的基因表达综合(GEO)元数据重构。
Database (Oxford). 2019 Jan 1;2019:bay145. doi: 10.1093/database/bay145.
5
Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata.通过聚类进行清理:解决生物医学元数据中数据质量问题的方法。
BMC Bioinformatics. 2017 Sep 18;18(1):415. doi: 10.1186/s12859-017-1832-4.
6
Discovery of perturbation gene targets via free text metadata mining in Gene Expression Omnibus.通过在基因表达综合数据库中进行自由文本元数据挖掘发现干扰基因靶标。
Comput Biol Chem. 2019 Jun;80:152-158. doi: 10.1016/j.compbiolchem.2019.03.014. Epub 2019 Mar 24.
7
Using association rule mining and ontologies to generate metadata recommendations from multiple biomedical databases.使用关联规则挖掘和本体生成来自多个生物医学数据库的元数据推荐。
Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz059.
8
Mining data and metadata from the gene expression omnibus.从基因表达综合数据库挖掘数据和元数据。
Biophys Rev. 2019 Feb;11(1):103-110. doi: 10.1007/s12551-018-0490-8. Epub 2018 Dec 29.
9
A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records.一种用于扩展GenBank记录中地理空间元数据的基于规则的高精度提取系统。
J Am Med Inform Assoc. 2016 Sep;23(5):934-41. doi: 10.1093/jamia/ocv172. Epub 2016 Jan 17.
10
CEDAR OnDemand: a browser extension to generate ontology-based scientific metadata.CEDAR OnDemand:一个基于本体的科学元数据生成的浏览器扩展。
BMC Bioinformatics. 2018 Jul 16;19(1):268. doi: 10.1186/s12859-018-2247-6.

引用本文的文献

1
The Effect of Vitamin D Deficiency on Immune-Related Hub Genes: A Network Analysis Associated With Type 1 Diabetes.维生素D缺乏对免疫相关枢纽基因的影响:一项与1型糖尿病相关的网络分析
Cureus. 2024 Sep 4;16(9):e68611. doi: 10.7759/cureus.68611. eCollection 2024 Sep.
2
Unraveling the roles of gene and immune-metabolic pathways in psoriasis: a bioinformatics exploration for diagnostic markers and therapeutic targets.解析基因和免疫代谢途径在银屑病中的作用:诊断标志物和治疗靶点的生物信息学探索
Front Mol Biosci. 2024 Aug 22;11:1439837. doi: 10.3389/fmolb.2024.1439837. eCollection 2024.
3
Systematic tissue annotations of genomics samples by modeling unstructured metadata.

本文引用的文献

1
Predicting structured metadata from unstructured metadata.从非结构化元数据预测结构化元数据。
Database (Oxford). 2016 Jan 1;2016. doi: 10.1093/database/baw080.
2
Using EHRs and Machine Learning for Heart Failure Survival Analysis.利用电子健康记录和机器学习进行心力衰竭生存分析。
Stud Health Technol Inform. 2015;216:40-4.
3
The center for expanded data annotation and retrieval.扩展数据注释与检索中心
通过对非结构化元数据进行建模来对基因组学样本进行系统的组织注释。
Nat Commun. 2022 Nov 8;13(1):6736. doi: 10.1038/s41467-022-34435-x.
4
Maximizing the reusability of gene expression data by predicting missing metadata.通过预测缺失的元数据来最大化基因表达数据的可重用性。
PLoS Comput Biol. 2020 Nov 6;16(11):e1007450. doi: 10.1371/journal.pcbi.1007450. eCollection 2020 Nov.
5
Applying citizen science to gene, drug and disease relationship extraction from biomedical abstracts.将公民科学应用于从生物医学摘要中提取基因、药物和疾病关系。
Bioinformatics. 2020 Feb 15;36(4):1226-1233. doi: 10.1093/bioinformatics/btz678.
6
Restructured GEO: restructuring Gene Expression Omnibus metadata for genome dynamics analysis.重构 GEO:用于基因组动态分析的基因表达综合(GEO)元数据重构。
Database (Oxford). 2019 Jan 1;2019:bay145. doi: 10.1093/database/bay145.
7
Mining data and metadata from the gene expression omnibus.从基因表达综合数据库挖掘数据和元数据。
Biophys Rev. 2019 Feb;11(1):103-110. doi: 10.1007/s12551-018-0490-8. Epub 2018 Dec 29.
8
Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations.使用基于本体的建议实现快速准确的元数据创作。
AMIA Annu Symp Proc. 2018 Apr 16;2017:1272-1281. eCollection 2017.
J Am Med Inform Assoc. 2015 Nov;22(6):1148-52. doi: 10.1093/jamia/ocv048. Epub 2015 Jun 25.
4
A novel matrix-similarity based loss function for joint regression and classification in AD diagnosis.一种用于阿尔茨海默病诊断中联合回归与分类的基于矩阵相似度的新型损失函数。
Neuroimage. 2014 Oct 15;100:91-105. doi: 10.1016/j.neuroimage.2014.05.078. Epub 2014 Jun 7.
5
massiR: a method for predicting the sex of samples in gene expression microarray datasets.massiR:一种用于预测基因表达微阵列数据集样本性别的方法。
Bioinformatics. 2014 Jul 15;30(14):2084-5. doi: 10.1093/bioinformatics/btu161. Epub 2014 Mar 22.
6
NCBI's Database of Genotypes and Phenotypes: dbGaP.NCBI 的基因型和表型数据库:dbGaP。
Nucleic Acids Res. 2014 Jan;42(Database issue):D975-9. doi: 10.1093/nar/gkt1211. Epub 2013 Dec 1.
7
Protein function prediction by collective classification with explicit and implicit edges in protein-protein interaction networks.基于蛋白质-蛋白质相互作用网络中显式和隐式边的集体分类进行蛋白质功能预测。
BMC Bioinformatics. 2013;14 Suppl 12(Suppl 12):S4. doi: 10.1186/1471-2105-14-S12-S4. Epub 2013 Sep 24.
8
NCBI GEO: archive for functional genomics data sets--update.NCBI GEO:功能基因组学数据集存档 - 更新。
Nucleic Acids Res. 2013 Jan;41(Database issue):D991-5. doi: 10.1093/nar/gks1193. Epub 2012 Nov 27.
9
Development and validation of a novel molecular biomarker diagnostic test for the early detection of sepsis.开发和验证一种新型分子生物标志物诊断检测方法,用于早期检测脓毒症。
Crit Care. 2011 Jun 20;15(3):R149. doi: 10.1186/cc10274.
10
RightField: embedding ontology annotation in spreadsheets.右外野:在电子表格中嵌入本体注释。
Bioinformatics. 2011 Jul 15;27(14):2021-2. doi: 10.1093/bioinformatics/btr312. Epub 2011 May 26.