Enzo Ferrari Engineering Department, University of Modena and Reggio Emilia, Via P. Vivarelli, 10, Modena, Emilia Romagna 41125, Italy.
Department of Control and Computer Engineering, Corso Duca degli Abruzzi, 24, Turin, Piedmont 10129 Italy.
Comput Methods Programs Biomed. 2022 Oct;225:107035. doi: 10.1016/j.cmpb.2022.107035. Epub 2022 Aug 7.
In the latest years, the prediction of gene expression levels has been crucial due to its potential applications in the clinics. In this context, Xpresso and others methods based on Convolutional Neural Networks and Transformers were firstly proposed to this aim. However, all these methods embed data with a standard one-hot encoding algorithm, resulting in impressively sparse matrices. In addition, post-transcriptional regulation processes, which are of uttermost importance in the gene expression process, are not considered in the model.
This paper presents Transformer DeepLncLoc, a novel method to predict the abundance of the mRNA (i.e., gene expression levels) by processing gene promoter sequences, managing the problem as a regression task. The model exploits a transformer-based architecture, introducing the DeepLncLoc method to perform the data embedding. Since DeepLncloc is based on word2vec algorithm, it avoids the sparse matrices problem.
Post-transcriptional information related to mRNA stability and transcription factors is included in the model, leading to significantly improved performances compared to the state-of-the-art works. Transformer DeepLncLoc reached 0.76 of R evaluation metric compared to 0.74 of Xpresso.
The Multi-Headed Attention mechanisms which characterizes the transformer methodology is suitable for modeling the interactions between DNA's locations, overcoming the recurrent models. Finally, the integration of the transcription factors data in the pipeline leads to impressive gains in predictive power.
近年来,由于其在临床中的潜在应用,基因表达水平的预测变得至关重要。在此背景下,Xpresso 及其他基于卷积神经网络和转换器的方法首次被提出以实现这一目标。然而,所有这些方法都使用标准的独热编码算法对数据进行嵌入,导致矩阵非常稀疏。此外,模型中未考虑在后转录调控过程,这在基因表达过程中至关重要。
本文提出了 Transformer DeepLncLoc,这是一种通过处理基因启动子序列来预测 mRNA 丰度(即基因表达水平)的新方法,将该问题视为回归任务。该模型利用基于转换器的架构,引入 DeepLncLoc 方法进行数据嵌入。由于 DeepLncLoc 基于 word2vec 算法,因此避免了矩阵稀疏的问题。
模型中包含与 mRNA 稳定性和转录因子相关的后转录信息,与最先进的方法相比,性能得到了显著提高。Transformer DeepLncLoc 的 R 评估指标达到 0.76,而 Xpresso 为 0.74。
Transformer 方法的多头注意力机制适合于建模 DNA 位置之间的相互作用,克服了递归模型的局限性。最后,将转录因子数据集成到管道中可以显著提高预测能力。