GenoSurf：元数据驱动的语义搜索系统，用于整合基因组数据集。

GenoSurf: metadata driven semantic search system for integrated genomic datasets.

机构信息

Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133 Milan, Italy.

出版信息

Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz132.

DOI:10.1093/database/baz132

PMID:31820804

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6902006/

Abstract

Many valuable resources developed by world-wide research institutions and consortia describe genomic datasets that are both open and available for secondary research, but their metadata search interfaces are heterogeneous, not interoperable and sometimes with very limited capabilities. We implemented GenoSurf, a multi-ontology semantic search system providing access to a consolidated collection of metadata attributes found in the most relevant genomic datasets; values of 10 attributes are semantically enriched by making use of the most suited available ontologies. The user of GenoSurf provides as input the search terms, sets the desired level of ontological enrichment and obtains as output the identity of matching data files at the various sources. Search is facilitated by drop-down lists of matching values; aggregate counts describing resulting files are updated in real time while the search terms are progressively added. In addition to the consolidated attributes, users can perform keyword-based searches on the original (raw) metadata, which are also imported; GenoSurf supports the interplay of attribute-based and keyword-based search through well-defined interfaces. Currently, GenoSurf integrates about 40 million metadata of several major valuable data sources, including three providers of clinical and experimental data (TCGA, ENCODE and Roadmap Epigenomics) and two sources of annotation data (GENCODE and RefSeq); it can be used as a standalone resource for targeting the genomic datasets at their original sources (identified with their accession IDs and URLs), or as part of an integrated query answering system for performing complex queries over genomic regions and metadata.

摘要

许多由全球研究机构和联盟开发的有价值的资源描述了基因组数据集，这些数据集是开放的，可供二次研究使用，但它们的元数据搜索界面是异构的，不能互操作，有时功能非常有限。我们实现了 GenoSurf，这是一个多本体语义搜索系统，提供对元数据属性的综合收集，这些属性存在于最相关的基因组数据集中；通过使用最合适的现有本体，对 10 个属性的值进行语义丰富。GenoSurf 的用户提供输入搜索词，设置所需的本体丰富度级别，并获得在各种来源中匹配数据文件的身份。搜索通过匹配值的下拉列表来进行；在搜索词逐渐添加的同时，描述结果文件的聚合计数会实时更新。除了综合属性之外，用户还可以对原始（原始）元数据执行基于关键字的搜索，这些元数据也被导入；GenoSurf 通过定义良好的接口支持基于属性和基于关键字的搜索的交互。目前，GenoSurf 整合了来自几个主要有价值数据源的约 4000 万条元数据，包括三个临床和实验数据提供商（TCGA、ENCODE 和 Roadmap Epigenomics）和两个注释数据来源（GENCODE 和 RefSeq）；它可以作为一种独立的资源，用于针对原始来源的基因组数据集（通过其访问 ID 和 URL 识别），也可以作为执行基因组区域和元数据的复杂查询的集成查询回答系统的一部分。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f6b0/6902006/b396b4db40af/baz132f1.jpg

相似文献

GenoSurf: metadata driven semantic search system for integrated genomic datasets.

Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz132.

Ontology-Based Search of Genomic Metadata.

IEEE/ACM Trans Comput Biol Bioinform. 2016 Mar-Apr;13(2):233-47. doi: 10.1109/TCBB.2015.2495179. Epub 2015 Oct 26.

linkedISA: semantic representation of ISA-Tab experimental metadata.

BMC Bioinformatics. 2014;15 Suppl 14(Suppl 14):S4. doi: 10.1186/1471-2105-15-S14-S4. Epub 2014 Nov 27.

Scaling the walls of discovery: using semantic metadata for integrative problem solving.

Brief Bioinform. 2009 Mar;10(2):164-76. doi: 10.1093/bib/bbp007.

A semantic proteomics dashboard (SemPoD) for data management in translational research.

BMC Syst Biol. 2012;6 Suppl 3(Suppl 3):S20. doi: 10.1186/1752-0509-6-S3-S20. Epub 2012 Dec 17.

SATORI: a system for ontology-guided visual exploration of biomedical data repositories.

Bioinformatics. 2018 Apr 1;34(7):1200-1207. doi: 10.1093/bioinformatics/btx739.

"METAGENOTE: a simplified web platform for metadata annotation of genomic samples and streamlined submission to NCBI's sequence read archive".

BMC Bioinformatics. 2020 Sep 3;21(1):378. doi: 10.1186/s12859-020-03694-0.

Towards a semantic medical Web: HealthCyberMap's tool for building an RDF metadata base of health information resources based on the Qualified Dublin Core Metadata Set.

Med Sci Monit. 2002 Jul;8(7):MT124-36.

An Annotation Workbench for Semantic Annotation of Data Collection Instruments.

Stud Health Technol Inform. 2023 May 18;302:108-112. doi: 10.3233/SHTI230074.

ODMedit: uniform semantic annotation for data integration in medicine based on a public metadata repository.

BMC Med Res Methodol. 2016 Jun 1;16:65. doi: 10.1186/s12874-016-0164-9.

引用本文的文献

PEPhub: a database, web interface, and API for editing, sharing, and validating biological sample metadata.

Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae033.

Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets.

Bioengineering (Basel). 2024 Mar 8;11(3):263. doi: 10.3390/bioengineering11030263.

Ontologies for increasing the FAIRness of plant research data.

Front Plant Sci. 2023 Nov 30;14:1279694. doi: 10.3389/fpls.2023.1279694. eCollection 2023.

PEPhub: a database, web interface, and API for editing, sharing, and validating biological sample metadata.

bioRxiv. 2024 May 11:2023.08.15.551388. doi: 10.1101/2023.08.15.551388.

Challenges to sharing sample metadata in computational genomics.

Front Genet. 2023 May 23;14:1154198. doi: 10.3389/fgene.2023.1154198. eCollection 2023.

Processing genome-wide association studies within a repository of heterogeneous genomic datasets.

BMC Genom Data. 2023 Mar 3;24(1):13. doi: 10.1186/s12863-023-01111-y.

Genomic data integration and user-defined sample-set extraction for population variant analysis.

BMC Bioinformatics. 2022 Sep 29;23(1):401. doi: 10.1186/s12859-022-04927-0.

High Performance Integration Pipeline for Viral and Epitope Sequences.

BioTech (Basel). 2022 Mar 21;11(1):7. doi: 10.3390/biotech11010007.

GeMI: interactive interface for transformer-based Genomic Metadata Integration.

Database (Oxford). 2022 Jun 3;2022. doi: 10.1093/database/baac036.

Dug: a semantic search engine leveraging peer-reviewed knowledge to query biomedical data repositories.

Bioinformatics. 2022 Jun 13;38(12):3252-3258. doi: 10.1093/bioinformatics/btac284.

本文引用的文献

PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets.

BMC Bioinformatics. 2019 Nov 8;20(1):560. doi: 10.1186/s12859-019-3159-9.

Next-generation characterization of the Cancer Cell Line Encyclopedia.

Nature. 2019 May;569(7757):503-508. doi: 10.1038/s41586-019-1186-3. Epub 2019 May 8.

Meta-omics data and collection objects (MOD-CO): a conceptual schema and data model for processing sample data in meta-omics research.

Database (Oxford). 2019 Jan 1;2019:baz002. doi: 10.1093/database/baz002.

The Gene Ontology Resource: 20 years and still GOing strong.

Nucleic Acids Res. 2019 Jan 8;47(D1):D330-D338. doi: 10.1093/nar/gky1055.

GENCODE reference annotation for the human and mouse genomes.

Nucleic Acids Res. 2019 Jan 8;47(D1):D766-D773. doi: 10.1093/nar/gky955.

Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data.

Bioinformatics. 2019 Mar 1;35(5):729-736. doi: 10.1093/bioinformatics/bty688.

Where to search top-K biomedical ontologies?

Brief Bioinform. 2019 Jul 19;20(4):1477-1491. doi: 10.1093/bib/bby015.

The Encyclopedia of DNA elements (ENCODE): data portal update.

Nucleic Acids Res. 2018 Jan 4;46(D1):D794-D801. doi: 10.1093/nar/gkx1081.

The NCI Genomic Data Commons as an engine for precision medicine.

Blood. 2017 Jul 27;130(4):453-459. doi: 10.1182/blood-2017-03-735654. Epub 2017 Jun 9.

DATS, the data tag suite to enable discoverability of datasets.

Sci Data. 2017 Jun 6;4:170059. doi: 10.1038/sdata.2017.59.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

GenoSurf：元数据驱动的语义搜索系统，用于整合基因组数据集。

GenoSurf: metadata driven semantic search system for integrated genomic datasets.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献