Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center at Houston/1200 Pressler Street, Suite E-833, Houston, TX, 77030, USA and.
School of Biomedical Informatics, The University of Texas Health Science Center at Houston/7000 Fannin st. Suite 600, Houston, TX, 77030, USA.
Database (Oxford). 2020 Jan 1;2020:1. doi: 10.1093/database/baaa064.
It is a growing trend among researchers to make their data publicly available for experimental reproducibility and data reusability. Sharing data with fellow researchers helps in increasing the visibility of the work. On the other hand, there are researchers who are inhibited by the lack of data resources. To overcome this challenge, many repositories and knowledge bases have been established to date to ease data sharing. Further, in the past two decades, there has been an exponential increase in the number of datasets added to these dataset repositories. However, most of these repositories are domain-specific, and none of them can recommend datasets to researchers/users. Naturally, it is challenging for a researcher to keep track of all the relevant repositories for potential use. Thus, a dataset recommender system that recommends datasets to a researcher based on previous publications can enhance their productivity and expedite further research. This work adopts an information retrieval (IR) paradigm for dataset recommendation. We hypothesize that two fundamental differences exist between dataset recommendation and PubMed-style biomedical IR beyond the corpus. First, instead of keywords, the query is the researcher, embodied by his or her publications. Second, to filter the relevant datasets from non-relevant ones, researchers are better represented by a set of interests, as opposed to the entire body of their research. This second approach is implemented using a non-parametric clustering technique. These clusters are used to recommend datasets for each researcher using the cosine similarity between the vector representations of publication clusters and datasets. The maximum normalized discounted cumulative gain at 10 (NDCG@10), precision at 10 (p@10) partial and p@10 strict of 0.89, 0.78 and 0.61, respectively, were obtained using the proposed method after manual evaluation by five researchers. As per the best of our knowledge, this is the first study of its kind on content-based dataset recommendation. We hope that this system will further promote data sharing, offset the researchers' workload in identifying the right dataset and increase the reusability of biomedical datasets. Database URL: http://genestudy.org/recommends/#/.
越来越多的研究人员将数据公开,以提高实验的可重复性和数据的可重用性。与同行分享数据有助于提高工作的可见度。另一方面,由于缺乏数据资源,一些研究人员受到了限制。为了克服这一挑战,迄今为止已经建立了许多存储库和知识库来方便数据共享。此外,在过去的二十年中,这些数据集存储库中添加的数据集数量呈指数级增长。然而,这些存储库大多是特定于领域的,没有一个能够向研究人员/用户推荐数据集。自然而然,研究人员很难跟踪所有潜在的相关存储库以备将来使用。因此,基于研究人员之前的出版物向其推荐数据集的数据集推荐系统可以提高他们的工作效率,并加速进一步的研究。这项工作采用信息检索(IR)范式来进行数据集推荐。我们假设,除了语料库之外,数据集推荐与 PubMed 式生物医学 IR 之间存在两个基本差异。首先,查询不是关键词,而是研究人员,由他或她的出版物体现。其次,为了从非相关数据集中筛选出相关数据集,研究人员最好通过一组兴趣来表示,而不是他们整个研究领域。第二种方法是使用非参数聚类技术来实现的。对于每个研究人员,使用出版物聚类和数据集的向量表示之间的余弦相似度来推荐数据集。经过五名研究人员的手动评估,该方法获得了 0.89、0.78 和 0.61 的最大归一化折扣累积增益在 10 处(NDCG@10)、10 处的精度(p@10)部分和 10 处的精度(p@10)严格的精度。据我们所知,这是基于内容的数据集推荐的首次此类研究。我们希望该系统将进一步促进数据共享,减轻研究人员在识别正确数据集方面的工作量,并提高生物医学数据集的可重用性。数据库 URL:http://genestudy.org/recommends/#/。