Chow Savio Ho-Chit, Shi Christina Huan, Deshpande Aniruddha, Cao Qin, Yip Kevin Y
Center for Data Sciences, Sanford Burnham Prebys Medical Discovery Institute, La Jolla, CA 92037, USA.
Cancer Genome and Epigenetics Program, NCI-Designated Cancer Center, Sanford Burnham Prebys Medical Discovery Institute, La Jolla, CA 92037, USA.
bioRxiv. 2025 Jul 25:2025.07.21.665977. doi: 10.1101/2025.07.21.665977.
A holy grail in computational biology is accurate modeling of transcript expression levels using epigenetic features, which would provide a quantitative way to study gene regulation in normal and disease states. Previous studies relied heavily on immortalized cell lines that exhibit properties different from cells in natural tissue environments. Most studies also quantified the expression of each gene by a single expression level, which fails to capture separate expression levels of different transcript isoforms of the same gene. In this study, making use of the latest large-scale dataset of paired transcriptomic and epigenomic data of human samples produced by the International Human Epigenome Consortium (IHEC), we computationally modeled the expression levels of individual transcript isoforms in 324 samples from 29 tissue types. We constructed the models using graph-based methods that integrate both location-specific epigenomic features and multiple types of gene-gene relationships. We found that to infer transcript isoform expression levels in a sample, a model that integrates information from many samples of other tissue types consistently outperforms a model trained on data from this sample itself, providing strong support that it is possible to construct a "universal" model that can accurately infer transcript isoform expression levels across tissue types.
计算生物学中的一个圣杯是利用表观遗传特征对转录本表达水平进行精确建模,这将提供一种定量方法来研究正常和疾病状态下的基因调控。以往的研究严重依赖永生化细胞系,这些细胞系表现出与天然组织环境中的细胞不同的特性。大多数研究还通过单一表达水平来量化每个基因的表达,这无法捕捉同一基因不同转录本异构体的单独表达水平。在本研究中,利用国际人类表观基因组联盟(IHEC)产生的最新大规模人类样本配对转录组和表观基因组数据集,我们通过计算对来自29种组织类型的324个样本中单个转录本异构体的表达水平进行建模。我们使用基于图的方法构建模型,该方法整合了特定位置的表观基因组特征和多种类型的基因-基因关系。我们发现,为了推断样本中转录本异构体的表达水平,整合来自其他组织类型的许多样本信息的模型始终优于基于该样本自身数据训练的模型,这有力地支持了构建一个能够准确推断跨组织类型转录本异构体表达水平的“通用”模型是可能的。