• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于内容的数据集推荐系统,供研究人员使用——以基因表达综合数据库 (GEO) 为例

A content-based dataset recommendation system for researchers-a case study on Gene Expression Omnibus (GEO) repository.

机构信息

Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center at Houston/1200 Pressler Street, Suite E-833, Houston, TX, 77030, USA and.

School of Biomedical Informatics, The University of Texas Health Science Center at Houston/7000 Fannin st. Suite 600, Houston, TX, 77030, USA.

出版信息

Database (Oxford). 2020 Jan 1;2020:1. doi: 10.1093/database/baaa064.

DOI:10.1093/database/baaa064
PMID:33002137
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7659921/
Abstract

It is a growing trend among researchers to make their data publicly available for experimental reproducibility and data reusability. Sharing data with fellow researchers helps in increasing the visibility of the work. On the other hand, there are researchers who are inhibited by the lack of data resources. To overcome this challenge, many repositories and knowledge bases have been established to date to ease data sharing. Further, in the past two decades, there has been an exponential increase in the number of datasets added to these dataset repositories. However, most of these repositories are domain-specific, and none of them can recommend datasets to researchers/users. Naturally, it is challenging for a researcher to keep track of all the relevant repositories for potential use. Thus, a dataset recommender system that recommends datasets to a researcher based on previous publications can enhance their productivity and expedite further research. This work adopts an information retrieval (IR) paradigm for dataset recommendation. We hypothesize that two fundamental differences exist between dataset recommendation and PubMed-style biomedical IR beyond the corpus. First, instead of keywords, the query is the researcher, embodied by his or her publications. Second, to filter the relevant datasets from non-relevant ones, researchers are better represented by a set of interests, as opposed to the entire body of their research. This second approach is implemented using a non-parametric clustering technique. These clusters are used to recommend datasets for each researcher using the cosine similarity between the vector representations of publication clusters and datasets. The maximum normalized discounted cumulative gain at 10 (NDCG@10), precision at 10 (p@10) partial and p@10 strict of 0.89, 0.78 and 0.61, respectively, were obtained using the proposed method after manual evaluation by five researchers. As per the best of our knowledge, this is the first study of its kind on content-based dataset recommendation. We hope that this system will further promote data sharing, offset the researchers' workload in identifying the right dataset and increase the reusability of biomedical datasets. Database URL: http://genestudy.org/recommends/#/.

摘要

越来越多的研究人员将数据公开,以提高实验的可重复性和数据的可重用性。与同行分享数据有助于提高工作的可见度。另一方面,由于缺乏数据资源,一些研究人员受到了限制。为了克服这一挑战,迄今为止已经建立了许多存储库和知识库来方便数据共享。此外,在过去的二十年中,这些数据集存储库中添加的数据集数量呈指数级增长。然而,这些存储库大多是特定于领域的,没有一个能够向研究人员/用户推荐数据集。自然而然,研究人员很难跟踪所有潜在的相关存储库以备将来使用。因此,基于研究人员之前的出版物向其推荐数据集的数据集推荐系统可以提高他们的工作效率,并加速进一步的研究。这项工作采用信息检索(IR)范式来进行数据集推荐。我们假设,除了语料库之外,数据集推荐与 PubMed 式生物医学 IR 之间存在两个基本差异。首先,查询不是关键词,而是研究人员,由他或她的出版物体现。其次,为了从非相关数据集中筛选出相关数据集,研究人员最好通过一组兴趣来表示,而不是他们整个研究领域。第二种方法是使用非参数聚类技术来实现的。对于每个研究人员,使用出版物聚类和数据集的向量表示之间的余弦相似度来推荐数据集。经过五名研究人员的手动评估,该方法获得了 0.89、0.78 和 0.61 的最大归一化折扣累积增益在 10 处(NDCG@10)、10 处的精度(p@10)部分和 10 处的精度(p@10)严格的精度。据我们所知,这是基于内容的数据集推荐的首次此类研究。我们希望该系统将进一步促进数据共享,减轻研究人员在识别正确数据集方面的工作量,并提高生物医学数据集的可重用性。数据库 URL:http://genestudy.org/recommends/#/。

相似文献

1
A content-based dataset recommendation system for researchers-a case study on Gene Expression Omnibus (GEO) repository.基于内容的数据集推荐系统,供研究人员使用——以基因表达综合数据库 (GEO) 为例
Database (Oxford). 2020 Jan 1;2020:1. doi: 10.1093/database/baaa064.
2
A content-based literature recommendation system for datasets to improve data reusability - A case study on Gene Expression Omnibus (GEO) datasets.基于内容的文献推荐系统,用于数据集,以提高数据可重用性 - 以基因表达综合 (GEO) 数据集为例。
J Biomed Inform. 2020 Apr;104:103399. doi: 10.1016/j.jbi.2020.103399. Epub 2020 Mar 6.
3
Leveraging word embeddings and medical entity extraction for biomedical dataset retrieval using unstructured texts.利用词嵌入和医学实体提取,通过非结构化文本检索生物医学数据集。
Database (Oxford). 2017 Jan 1;2017. doi: 10.1093/database/bax091.
4
Fair trade in building digital knowledge repositories: the knowledge economy as if researchers mattered.建筑数字知识仓库的公平贸易:知识经济,就好像研究人员很重要一样。
Med Health Care Philos. 2020 Dec;23(4):549-563. doi: 10.1007/s11019-020-09966-z.
5
Evaluation of repositories for sharing individual-participant data from clinical studies.用于共享临床研究中个体参与者数据的储存库评估。
Trials. 2019 Mar 15;20(1):169. doi: 10.1186/s13063-019-3253-3.
6
Recall and bias of retrieving gene expression microarray datasets through PubMed identifiers.通过PubMed标识符检索基因表达微阵列数据集的召回率和偏差。
J Biomed Discov Collab. 2010 Mar 28;5:7-20.
7
Recommender system of scholarly papers using public datasets.使用公共数据集的学术论文推荐系统。
AMIA Jt Summits Transl Sci Proc. 2021 May 17;2021:672-679. eCollection 2021.
8
Daily life in the Open Biologist's second job, as a Data Curator.开放生物学家的第二份工作——数据管理员的日常生活。
Wellcome Open Res. 2024 Dec 5;9:523. doi: 10.12688/wellcomeopenres.22899.1. eCollection 2024.
9
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
10
ArrayWiki: an enabling technology for sharing public microarray data repositories and meta-analyses.ArrayWiki:一种用于共享公共微阵列数据存储库和荟萃分析的支持技术。
BMC Bioinformatics. 2008 May 28;9 Suppl 6(Suppl 6):S18. doi: 10.1186/1471-2105-9-S6-S18.

引用本文的文献

1
Exploration of common genomic signatures of systemic juvenile idiopathic arthritis and Kawasaki disease.全身型幼年特发性关节炎与川崎病常见基因组特征的探索。
Clin Rheumatol. 2025 Aug 8. doi: 10.1007/s10067-025-07574-x.
2
HSDSnake: a user-friendly SnakeMake pipeline for analysis of duplicate genes in eukaryotic genomes.HSDSnake:一个用于分析真核生物基因组中重复基因的用户友好型SnakeMake工作流程。
Bioinformatics. 2025 Jun 2;41(6). doi: 10.1093/bioinformatics/btaf325.
3
Using semantic search to find publicly available gene-expression datasets.

本文引用的文献

1
A content-based literature recommendation system for datasets to improve data reusability - A case study on Gene Expression Omnibus (GEO) datasets.基于内容的文献推荐系统,用于数据集,以提高数据可重用性 - 以基因表达综合 (GEO) 数据集为例。
J Biomed Inform. 2020 Apr;104:103399. doi: 10.1016/j.jbi.2020.103399. Epub 2020 Mar 6.
2
Restructured GEO: restructuring Gene Expression Omnibus metadata for genome dynamics analysis.重构 GEO:用于基因组动态分析的基因表达综合(GEO)元数据重构。
Database (Oxford). 2019 Jan 1;2019:bay145. doi: 10.1093/database/bay145.
3
A big data pipeline: Identifying dynamic gene regulatory networks from time-course Gene Expression Omnibus data with applications to influenza infection.
使用语义搜索来查找公开可用的基因表达数据集。
bioRxiv. 2025 Mar 15:2025.03.13.643153. doi: 10.1101/2025.03.13.643153.
4
BbGSD: Black-boned Sheep Genome SNP Database.BbGSD:黑骨羊基因组单核苷酸多态性数据库。
Database (Oxford). 2025 Jan 28;2025. doi: 10.1093/database/baaf004.
5
Development of a multi-epitope chimeric vaccine in silico against Babesia bovis, Theileria annulata, and Anaplasma marginale using computational biology tools and reverse vaccinology approach.利用计算生物学工具和反向疫苗学方法在计算机上设计针对牛巴贝斯虫、环形泰勒虫和边缘无浆体的多表位嵌合疫苗。
PLoS One. 2025 Jan 24;20(1):e0312262. doi: 10.1371/journal.pone.0312262. eCollection 2025.
6
Stroma-associated FSTL3 is a factor of calcium channel-derived tumor fibrosis.基质相关 FSTL3 是钙通道衍生的肿瘤纤维化的一个因素。
Sci Rep. 2023 Dec 3;13(1):21317. doi: 10.1038/s41598-023-48574-8.
7
Uncovering Novel Roles of miR-122 in the Pathophysiology of the Liver: Potential Interaction with NRF1 and E2F4 Signaling.揭示miR-122在肝脏病理生理学中的新作用:与NRF1和E2F4信号通路的潜在相互作用
Cancers (Basel). 2023 Aug 16;15(16):4129. doi: 10.3390/cancers15164129.
8
Differential Modulation of miR-122 Transcription by TGFβ1/BMP6: Implications for Nonresolving Inflammation and Hepatocarcinogenesis.TGFβ1/BMP6 对 miR-122 转录的差异调节:对非解决炎症和肝癌发生的影响。
Cells. 2023 Jul 27;12(15):1955. doi: 10.3390/cells12151955.
9
Web-Based Patient Recommender Systems for Preventive Care: Protocol for Empirical Research Propositions.基于网络的预防性医疗患者推荐系统:实证研究命题方案
JMIR Res Protoc. 2023 Mar 30;12:e43316. doi: 10.2196/43316.
10
A novel NIH research grant recommender using BERT.一种使用 BERT 的 NIH 研究资助推荐新方法。
PLoS One. 2023 Jan 17;18(1):e0278636. doi: 10.1371/journal.pone.0278636. eCollection 2023.
大数据管道:从时间序列基因表达综合数据库中识别动态基因调控网络及其在流感感染中的应用。
Stat Methods Med Res. 2018 Jul;27(7):1930-1955. doi: 10.1177/0962280217746719.
4
Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval.基于概率和机器学习的生物医学数据集检索方法。
Database (Oxford). 2018 Jan 1;2018. doi: 10.1093/database/bax104.
5
GEOMetaCuration: a web-based application for accurate manual curation of Gene Expression Omnibus metadata.GEOMetaCuration:一个基于网络的应用程序,用于准确地手动整理基因表达综合数据集元数据。
Database (Oxford). 2018 Jan 1;2018. doi: 10.1093/database/bay019.
6
Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge.寻找相关的生物医学数据集:加州大学圣地亚哥分校为 bioCADDIE 检索挑战赛提供的解决方案。
Database (Oxford). 2018 Jan 1;2018. doi: 10.1093/database/bay017.
7
DataMed - an open source discovery index for finding biomedical datasets.DataMed——一个用于查找生物医学数据集的开源发现索引。
J Am Med Inform Assoc. 2018 Mar 1;25(3):300-308. doi: 10.1093/jamia/ocx121.
8
Query expansion using MeSH terms for dataset retrieval: OHSU at the bioCADDIE 2016 dataset retrieval challenge.使用 MeSH 术语进行查询扩展以进行数据集检索:OHSU 在 bioCADDIE 2016 数据集检索挑战赛中的表现。
Database (Oxford). 2017 Jan 1;2017. doi: 10.1093/database/bax065.
9
Elsevier's approach to the bioCADDIE 2016 Dataset Retrieval Challenge.爱思唯尔应对生物CADDIE 2016数据集检索挑战赛的方法。
Database (Oxford). 2017 Jan 1;2017. doi: 10.1093/database/bax056.
10
A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 bioCADDIE dataset retrieval challenge.生物医学数据集检索的公开基准:2016 年生物 CADDIE 数据集检索挑战赛的参考标准。
Database (Oxford). 2017 Jan 1;2017. doi: 10.1093/database/bax061.