DataMed——一个用于查找生物医学数据集的开源发现索引。

DataMed - an open source discovery index for finding biomedical datasets.

作者信息

Chen Xiaoling, Gururaj Anupama E, Ozyurt Burak, Liu Ruiling, Soysal Ergin, Cohen Trevor, Tiryaki Firat, Li Yueling, Zong Nansu, Jiang Min, Rogith Deevakar, Salimi Mandana, Kim Hyeon-Eui, Rocca-Serra Philippe, Gonzalez-Beltran Alejandra, Farcas Claudiu, Johnson Todd, Margolis Ron, Alter George, Sansone Susanna-Assunta, Fore Ian M, Ohno-Machado Lucila, Grethe Jeffrey S, Xu Hua

机构信息

School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA.

Center for Research in Biological Systems.

出版信息

J Am Med Inform Assoc. 2018 Mar 1;25(3):300-308. doi: 10.1093/jamia/ocx121.

DOI:10.1093/jamia/ocx121

PMID:29346583

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7378878/

Abstract

OBJECTIVE

Finding relevant datasets is important for promoting data reuse in the biomedical domain, but it is challenging given the volume and complexity of biomedical data. Here we describe the development of an open source biomedical data discovery system called DataMed, with the goal of promoting the building of additional data indexes in the biomedical domain.

MATERIALS AND METHODS

DataMed, which can efficiently index and search diverse types of biomedical datasets across repositories, is developed through the National Institutes of Health-funded biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium. It consists of 2 main components: (1) a data ingestion pipeline that collects and transforms original metadata information to a unified metadata model, called DatA Tag Suite (DATS), and (2) a search engine that finds relevant datasets based on user-entered queries. In addition to describing its architecture and techniques, we evaluated individual components within DataMed, including the accuracy of the ingestion pipeline, the prevalence of the DATS model across repositories, and the overall performance of the dataset retrieval engine.

RESULTS AND CONCLUSION

Our manual review shows that the ingestion pipeline could achieve an accuracy of 90% and core elements of DATS had varied frequency across repositories. On a manually curated benchmark dataset, the DataMed search engine achieved an inferred average precision of 0.2033 and a precision at 10 (P@10, the number of relevant results in the top 10 search results) of 0.6022, by implementing advanced natural language processing and terminology services. Currently, we have made the DataMed system publically available as an open source package for the biomedical community.

摘要

目的

找到相关数据集对于促进生物医学领域的数据重用很重要，但鉴于生物医学数据的数量和复杂性，这具有挑战性。在此，我们描述了一个名为DataMed的开源生物医学数据发现系统的开发，其目标是促进生物医学领域中更多数据索引的构建。

材料与方法

DataMed由美国国立卫生研究院资助的生物医学与医疗保健数据发现索引生态系统（bioCADDIE）联盟开发，它可以跨存储库对各种类型的生物医学数据集进行高效索引和搜索。它由两个主要组件组成：（1）一个数据摄取管道，该管道收集原始元数据信息并将其转换为统一的元数据模型，称为数据标签套件（DATS）；（2）一个搜索引擎，该引擎根据用户输入的查询查找相关数据集。除了描述其架构和技术外，我们还评估了DataMed中的各个组件，包括摄取管道的准确性、DATS模型在各存储库中的流行程度以及数据集检索引擎的整体性能。

结果与结论

我们的人工审核表明，摄取管道的准确率可达90%，且DATS的核心元素在各存储库中的出现频率各不相同。在一个人工整理的基准数据集上，通过实施先进的自然语言处理和术语服务，DataMed搜索引擎的推断平均精度达到0.2033，前10项结果的精度（P@10，前10个搜索结果中的相关结果数量）达到0.6022。目前，我们已将DataMed系统作为开源软件包向生物医学社区公开提供。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9548/7378878/5814e88ff033/ocx121f1.jpg

相似文献

DataMed - an open source discovery index for finding biomedical datasets.DataMed——一个用于查找生物医学数据集的开源发现索引。

J Am Med Inform Assoc. 2018 Mar 1;25(3):300-308. doi: 10.1093/jamia/ocx121.

User needs analysis and usability assessment of DataMed - a biomedical data discovery index.生物医学数据发现索引DataMed的用户需求分析与可用性评估

J Am Med Inform Assoc. 2018 Mar 1;25(3):337-344. doi: 10.1093/jamia/ocx134.

DATS, the data tag suite to enable discoverability of datasets.DATS，用于实现数据集可发现性的数据标签套件。

Sci Data. 2017 Jun 6;4:170059. doi: 10.1038/sdata.2017.59.

Data discovery with DATS: exemplar adoptions and lessons learned.利用 DATS 进行数据发现：典型采用案例和经验教训。

J Am Med Inform Assoc. 2018 Jan 1;25(1):13-16. doi: 10.1093/jamia/ocx119.

ImmuneData: an integrated data discovery system for immunology data repositories.ImmuneData：一个用于免疫学数据存储库的数据发现系统。

Database (Oxford). 2022 Mar 9;2022. doi: 10.1093/database/baac003.

Leveraging word embeddings and medical entity extraction for biomedical dataset retrieval using unstructured texts.利用词嵌入和医学实体提取，通过非结构化文本检索生物医学数据集。

Database (Oxford). 2017 Jan 1;2017. doi: 10.1093/database/bax091.

Development of an information retrieval tool for biomedical patents.生物医学专利信息检索工具的开发。

Comput Methods Programs Biomed. 2018 Jun;159:125-134. doi: 10.1016/j.cmpb.2018.03.012. Epub 2018 Mar 14.

Semantic biomedical resource discovery: a Natural Language Processing framework.语义生物医学资源发现：一种自然语言处理框架。

BMC Med Inform Decis Mak. 2015 Sep 30;15:77. doi: 10.1186/s12911-015-0200-4.

Dug: a semantic search engine leveraging peer-reviewed knowledge to query biomedical data repositories.Dug：一个利用经过同行评审的知识来查询生物医学数据存储库的语义搜索引擎。

Bioinformatics. 2022 Jun 13;38(12):3252-3258. doi: 10.1093/bioinformatics/btac284.

G-Bean: an ontology-graph based web tool for biomedical literature retrieval.G-Bean：基于本体图的生物医学文献检索网络工具。

BMC Bioinformatics. 2014;15 Suppl 12(Suppl 12):S1. doi: 10.1186/1471-2105-15-S12-S1. Epub 2014 Nov 6.

引用本文的文献

BioBricks.ai: a versioned data registry for life sciences data assets.BioBricks.ai：生命科学数据资产的版本化数据注册库。

Front Artif Intell. 2025 Aug 13;8:1599412. doi: 10.3389/frai.2025.1599412. eCollection 2025.

Genomics and multiomics in the age of precision medicine.精准医学时代的基因组学与多组学

Pediatr Res. 2025 Apr 4. doi: 10.1038/s41390-025-04021-0.

Using semantic search to find publicly available gene-expression datasets.使用语义搜索来查找公开可用的基因表达数据集。

bioRxiv. 2025 Mar 15:2025.03.13.643153. doi: 10.1101/2025.03.13.643153.

The Venus score for the assessment of the quality and trustworthiness of biomedical datasets.用于评估生物医学数据集质量和可信度的维纳斯评分。

BioData Min. 2025 Jan 9;18(1):1. doi: 10.1186/s13040-024-00412-x.

Towards Machine-FAIR: Representing software and datasets to facilitate reuse and scientific discovery by machines.迈向机器 FAIR：通过机器来表示软件和数据集，以促进其重复利用和科学发现。

J Biomed Inform. 2024 Jun;154:104647. doi: 10.1016/j.jbi.2024.104647. Epub 2024 Apr 30.

Applying Findable, Accessible, Interoperable, and Reusable Principles to Biospecimens and Biobanks.将可查找、可访问、可互操作和可重用原则应用于生物样本和生物样本库。

Biopreserv Biobank. 2024 Dec;22(6):550-556. doi: 10.1089/bio.2023.0110. Epub 2024 Feb 12.

Addressing barriers in FAIR data practices for biomedical data.解决生物医学数据的公平数据实践中的障碍。

Sci Data. 2023 Feb 23;10(1):98. doi: 10.1038/s41597-023-01969-8.

Developing a standardized but extendable framework to increase the findability of infectious disease datasets.开发一个标准化但可扩展的框架，以提高传染病数据集的可发现性。

Sci Data. 2023 Feb 23;10(1):99. doi: 10.1038/s41597-023-01968-9.

A hierarchical strategy to minimize privacy risk when linking "De-identified" data in biomedical research consortia.一种在生物医学研究联盟中链接“去识别”数据时最小化隐私风险的分层策略。

J Biomed Inform. 2023 Mar;139:104322. doi: 10.1016/j.jbi.2023.104322. Epub 2023 Feb 17.

A repository for the publication and sharing of heterogeneous materials data.一个用于发布和共享异类材料数据的存储库。

Sci Data. 2022 Dec 27;9(1):787. doi: 10.1038/s41597-022-01897-z.

本文引用的文献

A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 bioCADDIE dataset retrieval challenge.生物医学数据集检索的公开基准：2016 年生物 CADDIE 数据集检索挑战赛的参考标准。

Database (Oxford). 2017 Jan 1;2017. doi: 10.1093/database/bax061.

User needs analysis and usability assessment of DataMed - a biomedical data discovery index.生物医学数据发现索引DataMed的用户需求分析与可用性评估

J Am Med Inform Assoc. 2018 Mar 1;25(3):337-344. doi: 10.1093/jamia/ocx134.

DATS, the data tag suite to enable discoverability of datasets.DATS，用于实现数据集可发现性的数据标签套件。

Sci Data. 2017 Jun 6;4:170059. doi: 10.1038/sdata.2017.59.

Finding useful data across multiple biomedical data repositories using DataMed.利用 DataMed 在多个生物医学数据存储库中查找有用数据。

Nat Genet. 2017 May 26;49(6):816-819. doi: 10.1038/ng.3864.

Discovering and linking public omics data sets using the Omics Discovery Index.使用组学发现指数发现并链接公共组学数据集。

Nat Biotechnol. 2017 May 9;35(5):406-409. doi: 10.1038/nbt.3790.

MetaMap Lite: an evaluation of a new Java implementation of MetaMap.MetaMap精简版：对MetaMap新Java实现的评估

J Am Med Inform Assoc. 2017 Jul 1;24(4):841-844. doi: 10.1093/jamia/ocw177.

The FAIR Guiding Principles for scientific data management and stewardship.科学数据管理和保存的 FAIR 指导原则。

Sci Data. 2016 Mar 15;3:160018. doi: 10.1038/sdata.2016.18.

The Resource Identification Initiative: A cultural shift in publishing.资源识别倡议：出版领域的文化转变。

F1000Res. 2015 May 29;4:134. doi: 10.12688/f1000research.6555.2. eCollection 2015.

Cell line name recognition in support of the identification of synthetic lethality in cancer from text.支持从文本中识别癌症合成致死性的细胞系名称识别

Bioinformatics. 2016 Jan 15;32(2):276-82. doi: 10.1093/bioinformatics/btv570. Epub 2015 Oct 1.

The NIDDK Information Network: A Community Portal for Finding Data, Materials, and Tools for Researchers Studying Diabetes, Digestive, and Kidney Diseases.美国国立糖尿病、消化和肾脏疾病研究所信息网络：一个为研究糖尿病、消化系统疾病和肾脏疾病的研究人员查找数据、材料和工具的社区门户。

PLoS One. 2015 Sep 22;10(9):e0136206. doi: 10.1371/journal.pone.0136206. eCollection 2015.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

DataMed——一个用于查找生物医学数据集的开源发现索引。

DataMed - an open source discovery index for finding biomedical datasets.

作者信息

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS AND CONCLUSION

目的

材料与方法

结果与结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献