Suppr超能文献

使用语义搜索来查找公开可用的基因表达数据集。

Using semantic search to find publicly available gene-expression datasets.

作者信息

Brown Grace S, Wengler James, Fabelico Aaron Joyce S, Muir Abigail, Tubbs Anna, Warren Amanda, Millett Alexandra N, Yu Xinrui Xiang, Pavlidis Paul, Rogic Sanja, Piccolo Stephen R

机构信息

Department of Biology, Brigham Young University, Provo, Utah, USA.

Institute of Biosciences and Technology, Texas A&M Health Science Center, Houston, TX, USA.

出版信息

bioRxiv. 2025 Mar 15:2025.03.13.643153. doi: 10.1101/2025.03.13.643153.

Abstract

Millions of high-throughput, molecular datasets have been shared in public repositories. have been shared in public repositories. Researchers can reuse such data to validate their own findings and explore novel questions. A frequent goal is to find multiple datasets that address similar research topics and to either combine them directly or integrate inferences from them. However, a major challenge is finding relevant datasets due to the vast number of candidates, inconsistencies in their descriptions, and a lack of semantic annotations. This challenge is first among the FAIR principles for scientific data. Here we focus on dataset discovery within Gene Expression Omnibus (GEO), a repository containing 100,000s of data series. GEO supports queries based on keywords, ontology terms, and other annotations. However, reviewing these results is time-consuming and tedious, and it often misses relevant datasets. We hypothesized that language models could address this problem by summarizing dataset descriptions as numeric representations (embeddings). Assuming a researcher has previously found some relevant datasets, we evaluated the potential to find additional relevant datasets. For six human medical conditions, we used 30 models to generate embeddings for datasets that human curators had previously associated with the conditions and identified other datasets with the most similar descriptions. This approach was often, but not always, more effective than GEO's search engine. Our top-performing models were trained on general corpora, used contrastive-learning strategies, and used relatively large embeddings. Our findings suggest that language models have the potential to improve dataset discovery, perhaps in combination with existing search tools.

摘要

数以百万计的高通量分子数据集已在公共存储库中共享。研究人员可以重新使用这些数据来验证自己的发现并探索新问题。一个常见的目标是找到多个涉及相似研究主题的数据集,并直接将它们合并或整合其中的推断。然而,由于候选数据集数量众多、描述不一致以及缺乏语义注释,找到相关数据集是一项重大挑战。这一挑战在科学数据的 FAIR 原则中位列首要。在这里,我们专注于基因表达综合数据库(GEO)中的数据集发现,该数据库包含数以十万计的数据系列。GEO 支持基于关键词、本体术语和其他注释的查询。然而,审查这些结果既耗时又乏味,而且常常会错过相关数据集。我们假设语言模型可以通过将数据集描述总结为数字表示(嵌入)来解决这个问题。假设研究人员之前已经找到了一些相关数据集,我们评估了找到其他相关数据集的潜力。对于六种人类医疗状况,我们使用 30 个模型为人类策展人之前与这些状况相关联的数据集生成嵌入,并识别出描述最相似的其他数据集。这种方法通常(但并不总是)比 GEO 的搜索引擎更有效。我们表现最佳的模型是在通用语料库上训练的,采用了对比学习策略,并使用了相对较大的嵌入。我们的研究结果表明,语言模型有可能改进数据集发现,或许可以与现有的搜索工具结合使用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/35d3/11952526/49df6d6eb8ec/nihpp-2025.03.13.643153v1-f0001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验