Suppr超能文献

评估基因组区域无监督向量表示的方法。

Methods for evaluating unsupervised vector representations of genomic regions.

作者信息

Zheng Guangtao, Rymuza Julia, Gharavi Erfaneh, LeRoy Nathan J, Zhang Aidong, Sheffield Nathan C

机构信息

Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA.

Department of Genome Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.

出版信息

NAR Genom Bioinform. 2024 Aug 10;6(3):lqae086. doi: 10.1093/nargab/lqae086. eCollection 2024 Sep.

Abstract

Representation learning models have become a mainstay of modern genomics. These models are trained to yield vector representations, or embeddings, of various biological entities, such as cells, genes, individuals, or genomic regions. Recent applications of unsupervised embedding approaches have been shown to learn relationships among genomic regions that define functional elements in a genome. Unsupervised representation learning of genomic regions is free of the supervision from curated metadata and can condense rich biological knowledge from publicly available data to region embeddings. However, there exists no method for evaluating the quality of these embeddings in the absence of metadata, making it difficult to assess the reliability of analyses based on the embeddings, and to tune model training to yield optimal results. To bridge this gap, we propose four evaluation metrics: the cluster tendency score (CTS), the reconstruction score (RCS), the genome distance scaling score (GDSS), and the neighborhood preserving score (NPS). The CTS and RCS statistically quantify how well region embeddings can be clustered and how well the embeddings preserve information in training data. The GDSS and NPS exploit the biological tendency of regions close in genomic space to have similar biological functions; they measure how much such information is captured by individual region embeddings in a set. We demonstrate the utility of these statistical and biological scores for evaluating unsupervised genomic region embeddings and provide guidelines for learning reliable embeddings.

摘要

表征学习模型已成为现代基因组学的支柱。这些模型经过训练,以生成各种生物实体(如细胞、基因、个体或基因组区域)的向量表示或嵌入。无监督嵌入方法的最新应用已被证明可以学习基因组区域之间定义基因组功能元件的关系。基因组区域的无监督表征学习不受精心策划的元数据监督,并且可以将来自公开可用数据的丰富生物学知识浓缩到区域嵌入中。然而,在没有元数据的情况下,不存在评估这些嵌入质量的方法,这使得难以评估基于嵌入的分析的可靠性,也难以调整模型训练以产生最佳结果。为了弥补这一差距,我们提出了四个评估指标:聚类倾向得分(CTS)、重构得分(RCS)、基因组距离缩放得分(GDSS)和邻域保持得分(NPS)。CTS和RCS从统计学上量化了区域嵌入的聚类效果以及嵌入在训练数据中保留信息的程度。GDSS和NPS利用基因组空间中距离相近的区域具有相似生物学功能的生物学倾向;它们衡量一组中单个区域嵌入捕获此类信息的程度。我们展示了这些统计和生物学得分在评估无监督基因组区域嵌入方面的效用,并提供了学习可靠嵌入的指导方针。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55ca/11316252/fa0061d33cad/lqae086fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验