评估基因组区域无监督向量表示的方法。

Methods for evaluating unsupervised vector representations of genomic regions.

作者信息

Zheng Guangtao, Rymuza Julia, Gharavi Erfaneh, LeRoy Nathan J, Zhang Aidong, Sheffield Nathan C

机构信息

Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA.

Department of Genome Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.

出版信息

NAR Genom Bioinform. 2024 Aug 10;6(3):lqae086. doi: 10.1093/nargab/lqae086. eCollection 2024 Sep.

DOI:10.1093/nargab/lqae086

PMID:39131817

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11316252/

Abstract

Representation learning models have become a mainstay of modern genomics. These models are trained to yield vector representations, or embeddings, of various biological entities, such as cells, genes, individuals, or genomic regions. Recent applications of unsupervised embedding approaches have been shown to learn relationships among genomic regions that define functional elements in a genome. Unsupervised representation learning of genomic regions is free of the supervision from curated metadata and can condense rich biological knowledge from publicly available data to region embeddings. However, there exists no method for evaluating the quality of these embeddings in the absence of metadata, making it difficult to assess the reliability of analyses based on the embeddings, and to tune model training to yield optimal results. To bridge this gap, we propose four evaluation metrics: the cluster tendency score (CTS), the reconstruction score (RCS), the genome distance scaling score (GDSS), and the neighborhood preserving score (NPS). The CTS and RCS statistically quantify how well region embeddings can be clustered and how well the embeddings preserve information in training data. The GDSS and NPS exploit the biological tendency of regions close in genomic space to have similar biological functions; they measure how much such information is captured by individual region embeddings in a set. We demonstrate the utility of these statistical and biological scores for evaluating unsupervised genomic region embeddings and provide guidelines for learning reliable embeddings.

摘要

表征学习模型已成为现代基因组学的支柱。这些模型经过训练，以生成各种生物实体（如细胞、基因、个体或基因组区域）的向量表示或嵌入。无监督嵌入方法的最新应用已被证明可以学习基因组区域之间定义基因组功能元件的关系。基因组区域的无监督表征学习不受精心策划的元数据监督，并且可以将来自公开可用数据的丰富生物学知识浓缩到区域嵌入中。然而，在没有元数据的情况下，不存在评估这些嵌入质量的方法，这使得难以评估基于嵌入的分析的可靠性，也难以调整模型训练以产生最佳结果。为了弥补这一差距，我们提出了四个评估指标：聚类倾向得分（CTS）、重构得分（RCS）、基因组距离缩放得分（GDSS）和邻域保持得分（NPS）。CTS和RCS从统计学上量化了区域嵌入的聚类效果以及嵌入在训练数据中保留信息的程度。GDSS和NPS利用基因组空间中距离相近的区域具有相似生物学功能的生物学倾向；它们衡量一组中单个区域嵌入捕获此类信息的程度。我们展示了这些统计和生物学得分在评估无监督基因组区域嵌入方面的效用，并提供了学习可靠嵌入的指导方针。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55ca/11316252/fa0061d33cad/lqae086fig1.jpg

相似文献

Methods for evaluating unsupervised vector representations of genomic regions.评估基因组区域无监督向量表示的方法。

NAR Genom Bioinform. 2024 Aug 10;6(3):lqae086. doi: 10.1093/nargab/lqae086. eCollection 2024 Sep.

16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses.16S rRNA 序列嵌入：核苷酸序列有意义的数值特征表示形式，方便下游分析。

PLoS Comput Biol. 2019 Feb 26;15(2):e1006721. doi: 10.1371/journal.pcbi.1006721. eCollection 2019 Feb.

Embeddings of genomic region sets capture rich biological associations in lower dimensions.基因组区域集的嵌入在低维空间中捕获丰富的生物学关联。

Bioinformatics. 2021 Dec 7;37(23):4299-4306. doi: 10.1093/bioinformatics/btab439.

Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets.用于基因组区间集检索和注释的联合表示学习

Bioengineering (Basel). 2024 Mar 8;11(3):263. doi: 10.3390/bioengineering11030263.

Using molecular embeddings in QSAR modeling: does it make a difference?在定量构效关系建模中使用分子嵌入：有区别吗？

Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab365.

Unsupervised online multitask learning of behavioral sentence embeddings.行为句子嵌入的无监督在线多任务学习。

PeerJ Comput Sci. 2019 Jun 10;5:e200. doi: 10.7717/peerj-cs.200. eCollection 2019.

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale.生物概念向量：在大规模上创建和评估基于文献的生物医学概念嵌入。

PLoS Comput Biol. 2020 Apr 23;16(4):e1007617. doi: 10.1371/journal.pcbi.1007617. eCollection 2020 Apr.

Learning Canonical Embeddings for Unsupervised Shape Correspondence With Locally Linear Transformations.通过局部线性变换学习用于无监督形状对应性的规范嵌入。

IEEE Trans Pattern Anal Mach Intell. 2023 Dec;45(12):14872-14887. doi: 10.1109/TPAMI.2023.3307592. Epub 2023 Nov 3.

Learned protein embeddings for machine learning.机器学习的深度学习蛋白质嵌入。

Bioinformatics. 2018 Aug 1;34(15):2642-2648. doi: 10.1093/bioinformatics/bty178.

引用本文的文献

Methods for constructing and evaluating consensus genomic interval sets.构建和评估共识基因组区间集的方法。

Nucleic Acids Res. 2024 Sep 23;52(17):10119-10131. doi: 10.1093/nar/gkae685.

Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets.用于基因组区间集检索和注释的联合表示学习

Bioengineering (Basel). 2024 Mar 8;11(3):263. doi: 10.3390/bioengineering11030263.

本文引用的文献

Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings.使用预训练嵌入对单细胞ATAC数据进行快速聚类和细胞类型注释。

NAR Genom Bioinform. 2024 Jul 5;6(3):lqae073. doi: 10.1093/nargab/lqae073. eCollection 2024 Sep.

Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets.用于基因组区间集检索和注释的联合表示学习

Bioengineering (Basel). 2024 Mar 8;11(3):263. doi: 10.3390/bioengineering11030263.

Opportunities and challenges in sharing and reusing genomic interval data.共享和再利用基因组区间数据中的机遇与挑战。

Front Genet. 2023 Mar 20;14:1155809. doi: 10.3389/fgene.2023.1155809. eCollection 2023.

GEOfetch: a command-line tool for downloading data and standardized metadata from GEO and SRA.GEOfetch：一个命令行工具，用于从 GEO 和 SRA 下载数据和标准化元数据。

Bioinformatics. 2023 Mar 1;39(3). doi: 10.1093/bioinformatics/btad069.

Embeddings of genomic region sets capture rich biological associations in lower dimensions.基因组区域集的嵌入在低维空间中捕获丰富的生物学关联。

Bioinformatics. 2021 Dec 7;37(23):4299-4306. doi: 10.1093/bioinformatics/btab439.

Analytical Approaches for ATAC-seq Data Analysis.ATAC-seq 数据分析的分析方法。

Curr Protoc Hum Genet. 2020 Jun;106(1):e101. doi: 10.1002/cphg.101.

Epigenomic annotation-based interpretation of genomic data: from enrichment analysis to machine learning.基于表观基因组注释的基因组数据分析：从富集分析到机器学习。

Bioinformatics. 2017 Oct 15;33(20):3323-3330. doi: 10.1093/bioinformatics/btx414.

LOLA: enrichment analysis for genomic region sets and regulatory elements in R and Bioconductor.LOLA：R和Bioconductor中基因组区域集和调控元件的富集分析。

Bioinformatics. 2016 Feb 15;32(4):587-9. doi: 10.1093/bioinformatics/btv612. Epub 2015 Oct 27.

Cancer genomics: Non-coding mutations in the driver seat.癌症基因组学：非编码突变占据主导地位。

Nat Rev Genet. 2014 Sep;15(9):574-5. doi: 10.1038/nrg3801. Epub 2014 Aug 5.

Identifying and characterizing regulatory sequences in the human genome with chromatin accessibility assays.利用染色质可及性分析鉴定和描述人类基因组中的调控序列。

Genes (Basel). 2012 Oct 15;3(4):651-70. doi: 10.3390/genes3040651.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

评估基因组区域无监督向量表示的方法。

Methods for evaluating unsupervised vector representations of genomic regions.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献