使用Transformer模型对人类细胞系中的超级增强子进行仅序列预测。

Sequence-Only Prediction of Super-Enhancers in Human Cell Lines Using Transformer Models.

作者信息

Kravchuk Ekaterina V, Ashniev German A, Gladkova Marina G, Orlov Alexey V, Zaitseva Zoia G, Malkerov Juri A, Orlova Natalia N

机构信息

Prokhorov General Physics Institute of the Russian Academy of Sciences, 38 Vavilov St., 119991 Moscow, Russia.

Faculty of Biology, Lomonosov Moscow State University, Leninskiye Gory, MSU, 1-12, 119991 Moscow, Russia.

出版信息

Biology (Basel). 2025 Feb 7;14(2):172. doi: 10.3390/biology14020172.

DOI:10.3390/biology14020172

PMID:40001940

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11852244/

Abstract

The study discloses the application of transformer-based deep learning models for the task of super-enhancers prediction in human tumor cell lines with a specific focus on sequence-only features within studied entities of super-enhancer and enhancer elements in the human genome. The proposed SE-prediction method included the GENA-LM application at handling long DNA sequences with the classification task, distinguishing super-enhancers from enhancers using H3K36me, H3K4me1, H3K4me3 and H3K27ac landscape datasets from HeLa, HEK293, H2171, Jurkat, K562, MM1S and U87 cell lines. The model was fine-tuned on relevant sequence data, allowing for the analysis of extended genomic sequences without the need for epigenetic markers as proposed in early approaches. The study achieved balanced accuracy metrics, surpassing previous models like SENet, particularly in HEK293 and K562 cell lines. Also, it was shown that super-enhancers frequently co-localize with epigenetic marks such as H3K4me3 and H3K27ac. Therefore, the attention mechanism of the model provided insights into the sequence features contributing to SE classification, indicating a correlation between sequence-only features and mentioned epigenetic landscapes. These findings support the potential transformer models use in further genomic sequence analysis for bioinformatics applications in enhancer/super-enhancer characterization and gene regulation studies.

摘要

该研究揭示了基于Transformer的深度学习模型在人类肿瘤细胞系超级增强子预测任务中的应用，特别关注人类基因组中超级增强子和增强子元件所研究实体中的仅序列特征。所提出的SE预测方法包括将GENA-LM应用于处理具有分类任务的长DNA序列，使用来自HeLa、HEK293、H2171、Jurkat、K562、MM1S和U87细胞系的H3K36me、H3K4me1、H3K4me3和H3K27ac表观遗传景观数据集区分超级增强子和增强子。该模型在相关序列数据上进行了微调，无需早期方法中提出的表观遗传标记即可分析扩展的基因组序列。该研究实现了平衡的准确率指标，超过了之前的模型如SENet，特别是在HEK293和K562细胞系中。此外，研究表明超级增强子经常与H3K4me3和H3K27ac等表观遗传标记共定位。因此，模型的注意力机制为有助于SE分类的序列特征提供了见解，表明仅序列特征与上述表观遗传景观之间存在相关性。这些发现支持了潜在的Transformer模型在进一步的基因组序列分析中用于增强子/超级增强子表征和基因调控研究的生物信息学应用。