Gharavi Erfaneh, LeRoy Nathan J, Zheng Guangtao, Zhang Aidong, Brown Donald E, Sheffield Nathan C
Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.
School of Data Science, University of Virginia, Charlottesville, VA 22904, USA.
Bioengineering (Basel). 2024 Mar 8;11(3):263. doi: 10.3390/bioengineering11030263.
As available genomic interval data increase in scale, we require fast systems to search them. A common approach is simple string matching to compare a search term to metadata, but this is limited by incomplete or inaccurate annotations. An alternative is to compare data directly through genomic region overlap analysis, but this approach leads to challenges like sparsity, high dimensionality, and computational expense. We require novel methods to quickly and flexibly query large, messy genomic interval databases. Here, we develop a genomic interval search system using representation learning. We train numerical embeddings for a collection of region sets simultaneously with their metadata labels, capturing similarity between region sets and their metadata in a low-dimensional space. Using these learned co-embeddings, we develop a system that solves three related information retrieval tasks using embedding distance computations: retrieving region sets related to a user query string, suggesting new labels for database region sets, and retrieving database region sets similar to a query region set. We evaluate these use cases and show that jointly learned representations of region sets and metadata are a promising approach for fast, flexible, and accurate genomic region information retrieval.
随着可用基因组区间数据规模的增加,我们需要快速系统来搜索这些数据。一种常见的方法是简单字符串匹配,即将搜索词与元数据进行比较,但这受到注释不完整或不准确的限制。另一种方法是通过基因组区域重叠分析直接比较数据,但这种方法会带来诸如稀疏性、高维度和计算成本高等挑战。我们需要新颖的方法来快速灵活地查询大型、杂乱的基因组区间数据库。在此,我们开发了一种使用表示学习的基因组区间搜索系统。我们为一组区域集及其元数据标签同时训练数值嵌入,在低维空间中捕捉区域集与其元数据之间的相似性。利用这些学习到的共嵌入,我们开发了一个系统,该系统使用嵌入距离计算解决三个相关的信息检索任务:检索与用户查询字符串相关的区域集、为数据库区域集建议新标签以及检索与查询区域集相似的数据库区域集。我们评估了这些用例,并表明区域集和元数据的联合学习表示是一种用于快速、灵活和准确的基因组区域信息检索的有前途的方法。