Suppr超能文献

将困惑度作为人类转录组中异构体多样性的一个指标。

Perplexity as a Metric for Isoform Diversity in the Human Transcriptome.

作者信息

Schertzer Megan D, Park Stella H, Su Jiayu, Sheynkman Gloria M, Knowles David A

机构信息

New York Genome Center, New York, NY.

Department of Computer Science, Columbia University, New York, NY.

出版信息

bioRxiv. 2025 Jul 2:2025.07.02.662769. doi: 10.1101/2025.07.02.662769.

Abstract

Long-read sequencing (LRS) has revealed a far greater diversity of RNA isoforms than earlier technologies, increasing the critical need to determine which, and how many, isoforms per gene are biologically meaningful. To define the space of relevant isoforms from LRS, many existing analysis pipelines rely on arbitrary expression cutoffs, but a single threshold cannot accommodate the broad variability in isoform complexity across genes, cell-types, and disease states captured by LRS. To address this, we propose using -an interpretable measure derived from entropy-that quantifies the effective number of isoforms per gene based on the full, unfiltered isoform ratio distribution. Calculating perplexity for 124 ENCODE4 PacBio LRS datasets spanning 55 human cell types, we show that it provides intuitive assessments of isoform diversity and captures uncertainty across genes with varying complexity. Perplexity can be calculated at multiple gene regulatory levels-from transcript to protein-to compare how isoform diversity is reduced across stages of gene expression. On average, genes have an ORF-level perplexity of 2.1, indicating production of two distinct protein isoforms. We extended this analysis to evaluate expression variation across tissues and identified 4,593 ORFs across 3,102 genes with moderate to extreme tissue-specificity. We propose perplexity as a consistent, quantitative metric for interpreting isoform diversity across genes, cell types, and disease states. All results are compiled into a community resource to enable cross-study comparisons of novel isoforms.

摘要

长读长测序(LRS)揭示的RNA异构体多样性比早期技术要多得多,这使得确定每个基因中哪些异构体以及有多少异构体具有生物学意义变得愈发迫切。为了从LRS中定义相关异构体的空间,许多现有的分析流程依赖于任意的表达阈值,但单一阈值无法适应LRS所捕获的基因、细胞类型和疾病状态中异构体复杂性的广泛差异。为了解决这个问题,我们建议使用一种从熵推导而来的可解释度量,该度量基于完整的、未过滤的异构体比例分布来量化每个基因的有效异构体数量。通过计算来自55种人类细胞类型的124个ENCODE4 PacBio LRS数据集的困惑度,我们表明它提供了对异构体多样性的直观评估,并捕捉了不同复杂性基因的不确定性。困惑度可以在多个基因调控水平上计算——从转录本到蛋白质——以比较异构体多样性在基因表达各阶段是如何降低的。平均而言,基因的开放阅读框(ORF)水平困惑度为2.1,这表明产生了两种不同的蛋白质异构体。我们扩展了这项分析以评估不同组织间的表达变异,并在3102个基因中鉴定出4593个具有中度至极端组织特异性的ORF。我们建议将困惑度作为一种一致的定量指标,用于解释跨基因、细胞类型和疾病状态的异构体多样性。所有结果都被汇编成一个社区资源,以实现对新型异构体的跨研究比较。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7781/12236620/4e7b59b17667/nihpp-2025.07.02.662769v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验