Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133 Milan, Italy.
Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz132.
Many valuable resources developed by world-wide research institutions and consortia describe genomic datasets that are both open and available for secondary research, but their metadata search interfaces are heterogeneous, not interoperable and sometimes with very limited capabilities. We implemented GenoSurf, a multi-ontology semantic search system providing access to a consolidated collection of metadata attributes found in the most relevant genomic datasets; values of 10 attributes are semantically enriched by making use of the most suited available ontologies. The user of GenoSurf provides as input the search terms, sets the desired level of ontological enrichment and obtains as output the identity of matching data files at the various sources. Search is facilitated by drop-down lists of matching values; aggregate counts describing resulting files are updated in real time while the search terms are progressively added. In addition to the consolidated attributes, users can perform keyword-based searches on the original (raw) metadata, which are also imported; GenoSurf supports the interplay of attribute-based and keyword-based search through well-defined interfaces. Currently, GenoSurf integrates about 40 million metadata of several major valuable data sources, including three providers of clinical and experimental data (TCGA, ENCODE and Roadmap Epigenomics) and two sources of annotation data (GENCODE and RefSeq); it can be used as a standalone resource for targeting the genomic datasets at their original sources (identified with their accession IDs and URLs), or as part of an integrated query answering system for performing complex queries over genomic regions and metadata.
许多由全球研究机构和联盟开发的有价值的资源描述了基因组数据集,这些数据集是开放的,可供二次研究使用,但它们的元数据搜索界面是异构的,不能互操作,有时功能非常有限。我们实现了 GenoSurf,这是一个多本体语义搜索系统,提供对元数据属性的综合收集,这些属性存在于最相关的基因组数据集中;通过使用最合适的现有本体,对 10 个属性的值进行语义丰富。GenoSurf 的用户提供输入搜索词,设置所需的本体丰富度级别,并获得在各种来源中匹配数据文件的身份。搜索通过匹配值的下拉列表来进行;在搜索词逐渐添加的同时,描述结果文件的聚合计数会实时更新。除了综合属性之外,用户还可以对原始(原始)元数据执行基于关键字的搜索,这些元数据也被导入;GenoSurf 通过定义良好的接口支持基于属性和基于关键字的搜索的交互。目前,GenoSurf 整合了来自几个主要有价值数据源的约 4000 万条元数据,包括三个临床和实验数据提供商(TCGA、ENCODE 和 Roadmap Epigenomics)和两个注释数据来源(GENCODE 和 RefSeq);它可以作为一种独立的资源,用于针对原始来源的基因组数据集(通过其访问 ID 和 URL 识别),也可以作为执行基因组区域和元数据的复杂查询的集成查询回答系统的一部分。