一种在嵌入空间中进行基因集分析的最佳匹配方法。

A best-match approach for gene set analyses in embedding spaces.

机构信息

Department of Computer Science, Rice University, Houston, Texas 77005, USA.

Department of Computer Science, Rice University, Houston, Texas 77005, USA

出版信息

Genome Res. 2024 Oct 11;34(9):1421-1433. doi: 10.1101/gr.279141.124.

Abstract

Embedding methods have emerged as a valuable class of approaches for distilling essential information from complex high-dimensional data into more accessible lower-dimensional spaces. Applications of embedding methods to biological data have demonstrated that gene embeddings can effectively capture physical, structural, and functional relationships between genes. However, this utility has been primarily realized by using gene embeddings for downstream machine-learning tasks. Much less has been done to examine the embeddings directly, especially analyses of gene sets in embedding spaces. Here, we propose an Algorithm for Network Data Embedding and Similarity (ANDES), a novel best-match approach that can be used with existing gene embeddings to compare gene sets while reconciling gene set diversity. This intuitive method has important downstream implications for improving the utility of embedding spaces for various tasks. Specifically, we show how ANDES, when applied to different gene embeddings encoding protein-protein interactions, can be used as a novel overrepresentation- and rank-based gene set enrichment analysis method that achieves state-of-the-art performance. Additionally, ANDES can use multiorganism joint gene embeddings to facilitate functional knowledge transfer across organisms, allowing for phenotype mapping across model systems. Our flexible, straightforward best-match methodology can be extended to other embedding spaces with diverse community structures between set elements.

摘要

嵌入方法已经成为从复杂的高维数据中提取重要信息并将其转化为更易于访问的低维空间的一类有价值的方法。嵌入方法在生物数据中的应用表明,基因嵌入可以有效地捕捉基因之间的物理、结构和功能关系。然而,这种效用主要是通过将基因嵌入用于下游机器学习任务来实现的。很少有研究直接研究这些嵌入,特别是在嵌入空间中对基因集的分析。在这里,我们提出了一种网络数据嵌入和相似性算法(ANDES),这是一种新颖的最佳匹配方法,可以与现有的基因嵌入一起使用,在协调基因集多样性的同时比较基因集。这种直观的方法对提高嵌入空间在各种任务中的效用具有重要的下游意义。具体来说,我们展示了如何将 ANDES 应用于编码蛋白质-蛋白质相互作用的不同基因嵌入,将其用作一种新颖的基于过度表示和排名的基因集富集分析方法,实现了最先进的性能。此外,ANDES 可以使用多生物体联合基因嵌入来促进跨生物体的功能知识转移,允许在模型系统中进行表型映射。我们灵活、直接的最佳匹配方法可以扩展到具有不同社区结构的其他嵌入空间。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/705f/11529866/9607fc0bdd16/1421f01.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索