Suppr超能文献

DataMed——一个用于查找生物医学数据集的开源发现索引。

DataMed - an open source discovery index for finding biomedical datasets.

作者信息

Chen Xiaoling, Gururaj Anupama E, Ozyurt Burak, Liu Ruiling, Soysal Ergin, Cohen Trevor, Tiryaki Firat, Li Yueling, Zong Nansu, Jiang Min, Rogith Deevakar, Salimi Mandana, Kim Hyeon-Eui, Rocca-Serra Philippe, Gonzalez-Beltran Alejandra, Farcas Claudiu, Johnson Todd, Margolis Ron, Alter George, Sansone Susanna-Assunta, Fore Ian M, Ohno-Machado Lucila, Grethe Jeffrey S, Xu Hua

机构信息

School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA.

Center for Research in Biological Systems.

出版信息

J Am Med Inform Assoc. 2018 Mar 1;25(3):300-308. doi: 10.1093/jamia/ocx121.

Abstract

OBJECTIVE

Finding relevant datasets is important for promoting data reuse in the biomedical domain, but it is challenging given the volume and complexity of biomedical data. Here we describe the development of an open source biomedical data discovery system called DataMed, with the goal of promoting the building of additional data indexes in the biomedical domain.

MATERIALS AND METHODS

DataMed, which can efficiently index and search diverse types of biomedical datasets across repositories, is developed through the National Institutes of Health-funded biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium. It consists of 2 main components: (1) a data ingestion pipeline that collects and transforms original metadata information to a unified metadata model, called DatA Tag Suite (DATS), and (2) a search engine that finds relevant datasets based on user-entered queries. In addition to describing its architecture and techniques, we evaluated individual components within DataMed, including the accuracy of the ingestion pipeline, the prevalence of the DATS model across repositories, and the overall performance of the dataset retrieval engine.

RESULTS AND CONCLUSION

Our manual review shows that the ingestion pipeline could achieve an accuracy of 90% and core elements of DATS had varied frequency across repositories. On a manually curated benchmark dataset, the DataMed search engine achieved an inferred average precision of 0.2033 and a precision at 10 (P@10, the number of relevant results in the top 10 search results) of 0.6022, by implementing advanced natural language processing and terminology services. Currently, we have made the DataMed system publically available as an open source package for the biomedical community.

摘要

目的

找到相关数据集对于促进生物医学领域的数据重用很重要,但鉴于生物医学数据的数量和复杂性,这具有挑战性。在此,我们描述了一个名为DataMed的开源生物医学数据发现系统的开发,其目标是促进生物医学领域中更多数据索引的构建。

材料与方法

DataMed由美国国立卫生研究院资助的生物医学与医疗保健数据发现索引生态系统(bioCADDIE)联盟开发,它可以跨存储库对各种类型的生物医学数据集进行高效索引和搜索。它由两个主要组件组成:(1)一个数据摄取管道,该管道收集原始元数据信息并将其转换为统一的元数据模型,称为数据标签套件(DATS);(2)一个搜索引擎,该引擎根据用户输入的查询查找相关数据集。除了描述其架构和技术外,我们还评估了DataMed中的各个组件,包括摄取管道的准确性、DATS模型在各存储库中的流行程度以及数据集检索引擎的整体性能。

结果与结论

我们的人工审核表明,摄取管道的准确率可达90%,且DATS的核心元素在各存储库中的出现频率各不相同。在一个人工整理的基准数据集上,通过实施先进的自然语言处理和术语服务,DataMed搜索引擎的推断平均精度达到0.2033,前10项结果的精度(P@10,前10个搜索结果中的相关结果数量)达到0.6022。目前,我们已将DataMed系统作为开源软件包向生物医学社区公开提供。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9548/7378878/5814e88ff033/ocx121f1.jpg

相似文献

1
DataMed - an open source discovery index for finding biomedical datasets.
J Am Med Inform Assoc. 2018 Mar 1;25(3):300-308. doi: 10.1093/jamia/ocx121.
2
User needs analysis and usability assessment of DataMed - a biomedical data discovery index.
J Am Med Inform Assoc. 2018 Mar 1;25(3):337-344. doi: 10.1093/jamia/ocx134.
3
DATS, the data tag suite to enable discoverability of datasets.
Sci Data. 2017 Jun 6;4:170059. doi: 10.1038/sdata.2017.59.
4
Data discovery with DATS: exemplar adoptions and lessons learned.
J Am Med Inform Assoc. 2018 Jan 1;25(1):13-16. doi: 10.1093/jamia/ocx119.
5
ImmuneData: an integrated data discovery system for immunology data repositories.
Database (Oxford). 2022 Mar 9;2022. doi: 10.1093/database/baac003.
7
Development of an information retrieval tool for biomedical patents.
Comput Methods Programs Biomed. 2018 Jun;159:125-134. doi: 10.1016/j.cmpb.2018.03.012. Epub 2018 Mar 14.
8
Semantic biomedical resource discovery: a Natural Language Processing framework.
BMC Med Inform Decis Mak. 2015 Sep 30;15:77. doi: 10.1186/s12911-015-0200-4.
9
Dug: a semantic search engine leveraging peer-reviewed knowledge to query biomedical data repositories.
Bioinformatics. 2022 Jun 13;38(12):3252-3258. doi: 10.1093/bioinformatics/btac284.
10
G-Bean: an ontology-graph based web tool for biomedical literature retrieval.
BMC Bioinformatics. 2014;15 Suppl 12(Suppl 12):S1. doi: 10.1186/1471-2105-15-S12-S1. Epub 2014 Nov 6.

引用本文的文献

1
BioBricks.ai: a versioned data registry for life sciences data assets.
Front Artif Intell. 2025 Aug 13;8:1599412. doi: 10.3389/frai.2025.1599412. eCollection 2025.
2
Genomics and multiomics in the age of precision medicine.
Pediatr Res. 2025 Apr 4. doi: 10.1038/s41390-025-04021-0.
3
Using semantic search to find publicly available gene-expression datasets.
bioRxiv. 2025 Mar 15:2025.03.13.643153. doi: 10.1101/2025.03.13.643153.
4
The Venus score for the assessment of the quality and trustworthiness of biomedical datasets.
BioData Min. 2025 Jan 9;18(1):1. doi: 10.1186/s13040-024-00412-x.
5
Towards Machine-FAIR: Representing software and datasets to facilitate reuse and scientific discovery by machines.
J Biomed Inform. 2024 Jun;154:104647. doi: 10.1016/j.jbi.2024.104647. Epub 2024 Apr 30.
6
Applying Findable, Accessible, Interoperable, and Reusable Principles to Biospecimens and Biobanks.
Biopreserv Biobank. 2024 Dec;22(6):550-556. doi: 10.1089/bio.2023.0110. Epub 2024 Feb 12.
7
Addressing barriers in FAIR data practices for biomedical data.
Sci Data. 2023 Feb 23;10(1):98. doi: 10.1038/s41597-023-01969-8.
9
A hierarchical strategy to minimize privacy risk when linking "De-identified" data in biomedical research consortia.
J Biomed Inform. 2023 Mar;139:104322. doi: 10.1016/j.jbi.2023.104322. Epub 2023 Feb 17.
10
A repository for the publication and sharing of heterogeneous materials data.
Sci Data. 2022 Dec 27;9(1):787. doi: 10.1038/s41597-022-01897-z.

本文引用的文献

2
User needs analysis and usability assessment of DataMed - a biomedical data discovery index.
J Am Med Inform Assoc. 2018 Mar 1;25(3):337-344. doi: 10.1093/jamia/ocx134.
3
DATS, the data tag suite to enable discoverability of datasets.
Sci Data. 2017 Jun 6;4:170059. doi: 10.1038/sdata.2017.59.
4
Finding useful data across multiple biomedical data repositories using DataMed.
Nat Genet. 2017 May 26;49(6):816-819. doi: 10.1038/ng.3864.
5
Discovering and linking public omics data sets using the Omics Discovery Index.
Nat Biotechnol. 2017 May 9;35(5):406-409. doi: 10.1038/nbt.3790.
6
MetaMap Lite: an evaluation of a new Java implementation of MetaMap.
J Am Med Inform Assoc. 2017 Jul 1;24(4):841-844. doi: 10.1093/jamia/ocw177.
7
The FAIR Guiding Principles for scientific data management and stewardship.
Sci Data. 2016 Mar 15;3:160018. doi: 10.1038/sdata.2016.18.
8
The Resource Identification Initiative: A cultural shift in publishing.
F1000Res. 2015 May 29;4:134. doi: 10.12688/f1000research.6555.2. eCollection 2015.
9
Cell line name recognition in support of the identification of synthetic lethality in cancer from text.
Bioinformatics. 2016 Jan 15;32(2):276-82. doi: 10.1093/bioinformatics/btv570. Epub 2015 Oct 1.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验