Department of Computer Science and Information Technology, Institute of Technical Education and Research, Siksha 'O' Anusandhan (Deemed to be University), Bhubaneswar 751030, India.
Institute for System Analysis and Computer Science "Antonio Ruberti", National Research Council of Italy, 00185 Rome, Italy.
Int J Mol Sci. 2024 May 3;25(9):4990. doi: 10.3390/ijms25094990.
Prediction of binding sites for transcription factors is important to understand how the latter regulate gene expression and how this regulation can be modulated for therapeutic purposes. A consistent number of references address this issue with different approaches, Machine Learning being one of the most successful. Nevertheless, we note that many such approaches fail to propose a robust and meaningful method to embed the genetic data under analysis. We try to overcome this problem by proposing a bidirectional transformer-based encoder, empowered by bidirectional long-short term memory layers and with a capsule layer responsible for the final prediction. To evaluate the efficiency of the proposed approach, we use benchmark ChIP-seq datasets of five cell lines available in the ENCODE repository (A549, GM12878, Hep-G2, H1-hESC, and Hela). The results show that the proposed method can predict TFBS within the five different cell lines very well; moreover, cross-cell predictions provide satisfactory results as well. Experiments conducted across cell lines are reinforced by the analysis of five additional lines used only to test the model trained using the others. The results confirm that prediction across cell lines remains very high, allowing an extensive cross-transcription factor analysis to be performed from which several indications of interest for molecular biology may be drawn.
预测转录因子的结合位点对于理解后者如何调节基因表达以及如何为此目的进行调节非常重要。大量的参考文献采用不同的方法来解决这个问题,机器学习是最成功的方法之一。然而,我们注意到,许多这样的方法未能提出一种稳健且有意义的方法来嵌入正在分析的遗传数据。我们尝试通过提出一种基于双向转换器的编码器来克服这个问题,该编码器由双向长短时记忆层提供支持,并具有负责最终预测的胶囊层。为了评估所提出方法的效率,我们使用了 ENCODE 存储库中提供的五个细胞系的基准 ChIP-seq 数据集(A549、GM12878、Hep-G2、H1-hESC 和 Hela)。结果表明,该方法可以很好地预测五个不同细胞系中的 TFBS;此外,跨细胞预测也提供了令人满意的结果。仅用于测试使用其他细胞系训练的模型的另外五条线的实验也证实了跨细胞的实验结果,允许进行广泛的跨转录因子分析,从中可以得出一些对分子生物学感兴趣的指示。