• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于基因组区间集检索和注释的联合表示学习

Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets.

作者信息

Gharavi Erfaneh, LeRoy Nathan J, Zheng Guangtao, Zhang Aidong, Brown Donald E, Sheffield Nathan C

机构信息

Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.

School of Data Science, University of Virginia, Charlottesville, VA 22904, USA.

出版信息

Bioengineering (Basel). 2024 Mar 8;11(3):263. doi: 10.3390/bioengineering11030263.

DOI:10.3390/bioengineering11030263
PMID:38534537
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10967841/
Abstract

As available genomic interval data increase in scale, we require fast systems to search them. A common approach is simple string matching to compare a search term to metadata, but this is limited by incomplete or inaccurate annotations. An alternative is to compare data directly through genomic region overlap analysis, but this approach leads to challenges like sparsity, high dimensionality, and computational expense. We require novel methods to quickly and flexibly query large, messy genomic interval databases. Here, we develop a genomic interval search system using representation learning. We train numerical embeddings for a collection of region sets simultaneously with their metadata labels, capturing similarity between region sets and their metadata in a low-dimensional space. Using these learned co-embeddings, we develop a system that solves three related information retrieval tasks using embedding distance computations: retrieving region sets related to a user query string, suggesting new labels for database region sets, and retrieving database region sets similar to a query region set. We evaluate these use cases and show that jointly learned representations of region sets and metadata are a promising approach for fast, flexible, and accurate genomic region information retrieval.

摘要

随着可用基因组区间数据规模的增加,我们需要快速系统来搜索这些数据。一种常见的方法是简单字符串匹配,即将搜索词与元数据进行比较,但这受到注释不完整或不准确的限制。另一种方法是通过基因组区域重叠分析直接比较数据,但这种方法会带来诸如稀疏性、高维度和计算成本高等挑战。我们需要新颖的方法来快速灵活地查询大型、杂乱的基因组区间数据库。在此,我们开发了一种使用表示学习的基因组区间搜索系统。我们为一组区域集及其元数据标签同时训练数值嵌入,在低维空间中捕捉区域集与其元数据之间的相似性。利用这些学习到的共嵌入,我们开发了一个系统,该系统使用嵌入距离计算解决三个相关的信息检索任务:检索与用户查询字符串相关的区域集、为数据库区域集建议新标签以及检索与查询区域集相似的数据库区域集。我们评估了这些用例,并表明区域集和元数据的联合学习表示是一种用于快速、灵活和准确的基因组区域信息检索的有前途的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ded/10967841/5d7b7b64a74a/bioengineering-11-00263-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ded/10967841/85931bf48919/bioengineering-11-00263-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ded/10967841/2a1fdab8b021/bioengineering-11-00263-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ded/10967841/e18bed71d713/bioengineering-11-00263-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ded/10967841/72f77a50307e/bioengineering-11-00263-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ded/10967841/5d7b7b64a74a/bioengineering-11-00263-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ded/10967841/85931bf48919/bioengineering-11-00263-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ded/10967841/2a1fdab8b021/bioengineering-11-00263-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ded/10967841/e18bed71d713/bioengineering-11-00263-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ded/10967841/72f77a50307e/bioengineering-11-00263-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ded/10967841/5d7b7b64a74a/bioengineering-11-00263-g005.jpg

相似文献

1
Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets.用于基因组区间集检索和注释的联合表示学习
Bioengineering (Basel). 2024 Mar 8;11(3):263. doi: 10.3390/bioengineering11030263.
2
Embeddings of genomic region sets capture rich biological associations in lower dimensions.基因组区域集的嵌入在低维空间中捕获丰富的生物学关联。
Bioinformatics. 2021 Dec 7;37(23):4299-4306. doi: 10.1093/bioinformatics/btab439.
3
Learning supervised embeddings for large scale sequence comparisons.学习监督嵌入进行大规模序列比较。
PLoS One. 2020 Mar 13;15(3):e0216636. doi: 10.1371/journal.pone.0216636. eCollection 2020.
4
A boosting framework for visuality-preserving distance metric learning and its application to medical image retrieval.一种保持视觉保真度的距离度量学习的提升框架及其在医学图像检索中的应用。
IEEE Trans Pattern Anal Mach Intell. 2010 Jan;32(1):30-44. doi: 10.1109/TPAMI.2008.273.
5
Methods for evaluating unsupervised vector representations of genomic regions.评估基因组区域无监督向量表示的方法。
NAR Genom Bioinform. 2024 Aug 10;6(3):lqae086. doi: 10.1093/nargab/lqae086. eCollection 2024 Sep.
6
Accurate Approach Towards Efficiency of Searching Agents in Digital Libraries Using Keywords.利用关键词提高数字图书馆中搜索代理的效率的精确方法。
J Med Syst. 2019 May 1;43(6):164. doi: 10.1007/s10916-019-1294-5.
7
Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings.使用预训练嵌入对单细胞ATAC数据进行快速聚类和细胞类型注释。
NAR Genom Bioinform. 2024 Jul 5;6(3):lqae073. doi: 10.1093/nargab/lqae073. eCollection 2024 Sep.
8
A digital repository with an extensible data model for biobanking and genomic analysis management.一个具有可扩展数据模型的数字存储库,用于生物样本库和基因组分析管理。
BMC Genomics. 2014;15 Suppl 3(Suppl 3):S3. doi: 10.1186/1471-2164-15-S3-S3. Epub 2014 May 6.
9
GenoSurf: metadata driven semantic search system for integrated genomic datasets.GenoSurf:元数据驱动的语义搜索系统,用于整合基因组数据集。
Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz132.
10
Learning binary and sparse permutation-invariant representations for fast and memory efficient whole slide image search.学习二进制和稀疏排列不变表示,以实现快速和高效的内存全幻灯片图像搜索。
Comput Biol Med. 2023 Aug;162:107026. doi: 10.1016/j.compbiomed.2023.107026. Epub 2023 May 22.

引用本文的文献

1
Methods for constructing and evaluating consensus genomic interval sets.构建和评估共识基因组区间集的方法。
Nucleic Acids Res. 2024 Sep 23;52(17):10119-10131. doi: 10.1093/nar/gkae685.
2
Methods for evaluating unsupervised vector representations of genomic regions.评估基因组区域无监督向量表示的方法。
NAR Genom Bioinform. 2024 Aug 10;6(3):lqae086. doi: 10.1093/nargab/lqae086. eCollection 2024 Sep.
3
Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings.使用预训练嵌入对单细胞ATAC数据进行快速聚类和细胞类型注释。

本文引用的文献

1
Methods for constructing and evaluating consensus genomic interval sets.构建和评估共识基因组区间集的方法。
Nucleic Acids Res. 2024 Sep 23;52(17):10119-10131. doi: 10.1093/nar/gkae685.
2
Methods for evaluating unsupervised vector representations of genomic regions.评估基因组区域无监督向量表示的方法。
NAR Genom Bioinform. 2024 Aug 10;6(3):lqae086. doi: 10.1093/nargab/lqae086. eCollection 2024 Sep.
3
Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings.使用预训练嵌入对单细胞ATAC数据进行快速聚类和细胞类型注释。
NAR Genom Bioinform. 2024 Jul 5;6(3):lqae073. doi: 10.1093/nargab/lqae073. eCollection 2024 Sep.
NAR Genom Bioinform. 2024 Jul 5;6(3):lqae073. doi: 10.1093/nargab/lqae073. eCollection 2024 Sep.
4
A Bin-Based Indexing for Scalable Range Join on Genomic Data.基于 Bin 的索引在基因组数据上可扩展的范围连接。
IEEE/ACM Trans Comput Biol Bioinform. 2023 May-Jun;20(3):2210-2222. doi: 10.1109/TCBB.2023.3240196. Epub 2023 Jun 5.
5
Opportunities and challenges in sharing and reusing genomic interval data.共享和再利用基因组区间数据中的机遇与挑战。
Front Genet. 2023 Mar 20;14:1155809. doi: 10.3389/fgene.2023.1155809. eCollection 2023.
6
GEOfetch: a command-line tool for downloading data and standardized metadata from GEO and SRA.GEOfetch:一个命令行工具,用于从 GEO 和 SRA 下载数据和标准化元数据。
Bioinformatics. 2023 Mar 1;39(3). doi: 10.1093/bioinformatics/btad069.
7
From biomedical cloud platforms to microservices: next steps in FAIR data and analysis.从生物医学云平台到微服务:FAIR 数据和分析的下一步。
Sci Data. 2022 Sep 8;9(1):553. doi: 10.1038/s41597-022-01619-5.
8
scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks.scBasset:基于序列的单细胞 ATAC-seq 卷积神经网络建模。
Nat Methods. 2022 Sep;19(9):1088-1096. doi: 10.1038/s41592-022-01562-8. Epub 2022 Aug 8.
9
A multi-scale map of cell structure fusing protein images and interactions.融合蛋白质图像和相互作用的多尺度细胞结构图。
Nature. 2021 Dec;600(7889):536-542. doi: 10.1038/s41586-021-04115-9. Epub 2021 Nov 24.
10
Embeddings of genomic region sets capture rich biological associations in lower dimensions.基因组区域集的嵌入在低维空间中捕获丰富的生物学关联。
Bioinformatics. 2021 Dec 7;37(23):4299-4306. doi: 10.1093/bioinformatics/btab439.