Wang Yanshan, Rastegar-Mojarad Majid, Komandur-Elayavilli Ravikumar, Liu Hongfang
Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55901, USA.
Database (Oxford). 2017 Jan 1;2017. doi: 10.1093/database/bax091.
The recent movement towards open data in the biomedical domain has generated a large number of datasets that are publicly accessible. The Big Data to Knowledge data indexing project, biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE), has gathered these datasets in a one-stop portal aiming at facilitating their reuse for accelerating scientific advances. However, as the number of biomedical datasets stored and indexed increases, it becomes more and more challenging to retrieve the relevant datasets according to researchers' queries. In this article, we propose an information retrieval (IR) system to tackle this problem and implement it for the bioCADDIE Dataset Retrieval Challenge. The system leverages the unstructured texts of each dataset including the title and description for the dataset, and utilizes a state-of-the-art IR model, medical named entity extraction techniques, query expansion with deep learning-based word embeddings and a re-ranking strategy to enhance the retrieval performance. In empirical experiments, we compared the proposed system with 11 baseline systems using the bioCADDIE Dataset Retrieval Challenge datasets. The experimental results show that the proposed system outperforms other systems in terms of inference Average Precision and inference normalized Discounted Cumulative Gain, implying that the proposed system is a viable option for biomedical dataset retrieval. Database URL: https://github.com/yanshanwang/biocaddie2016mayodata.
生物医学领域最近朝着开放数据的方向发展,产生了大量可公开访问的数据集。“大数据到知识”数据索引项目,即生物医学与医疗保健数据发现索引生态系统(bioCADDIE),已将这些数据集收集到一个一站式门户中,旨在促进它们的再利用以加速科学进步。然而,随着存储和索引的生物医学数据集数量的增加,根据研究人员的查询检索相关数据集变得越来越具有挑战性。在本文中,我们提出了一种信息检索(IR)系统来解决这个问题,并针对bioCADDIE数据集检索挑战赛实现了该系统。该系统利用每个数据集的非结构化文本,包括数据集的标题和描述,并采用了先进的IR模型、医学命名实体提取技术、基于深度学习词嵌入的查询扩展以及重新排序策略来提高检索性能。在实证实验中,我们使用bioCADDIE数据集检索挑战赛数据集将所提出的系统与11个基线系统进行了比较。实验结果表明,所提出的系统在推理平均精度和推理归一化折损累计增益方面优于其他系统,这意味着所提出的系统是生物医学数据集检索的一个可行选择。数据库网址:https://github.com/yanshanwang/biocaddie2016mayodata 。