Zheng Leqiong, Liu Li, Zhu Wen, Ding Yijie, Wu Fangxiang
School of Mathematics and Statistics, Hainan Normal University, Haikou, China.
Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China.
Front Genet. 2023 Apr 18;14:1133775. doi: 10.3389/fgene.2023.1133775. eCollection 2023.
The physical interactions between enhancers and promoters are often involved in gene transcriptional regulation. High tissue-specific enhancer-promoter interactions (EPIs) are responsible for the differential expression of genes. Experimental methods are time-consuming and labor-intensive in measuring EPIs. An alternative approach, machine learning, has been widely used to predict EPIs. However, most existing machine learning methods require a large number of functional genomic and epigenomic features as input, which limits the application to different cell lines. In this paper, we developed a random forest model, HARD (H3K27ac, ATAC-seq, RAD21, and Distance), to predict EPI using only four types of features. Independent tests on a benchmark dataset showed that HARD outperforms other models with the fewest features. Our results revealed that chromatin accessibility and the binding of cohesin are important for cell-line-specific EPIs. Furthermore, we trained the HARD model in the GM12878 cell line and performed testing in the HeLa cell line. The cross-cell-lines prediction also performs well, suggesting it has the potential to be applied to other cell lines.
增强子与启动子之间的物理相互作用通常参与基因转录调控。高度组织特异性的增强子-启动子相互作用(EPI)决定了基因的差异表达。实验方法在测量EPI时既耗时又费力。另一种方法——机器学习,已被广泛用于预测EPI。然而,大多数现有的机器学习方法需要大量功能基因组和表观基因组特征作为输入,这限制了其在不同细胞系中的应用。在本文中,我们开发了一种随机森林模型HARD(H3K27ac、ATAC-seq、RAD21和距离),仅使用四种类型的特征来预测EPI。在一个基准数据集上的独立测试表明,HARD在特征最少的情况下优于其他模型。我们的结果表明,染色质可及性和黏连蛋白的结合对于细胞系特异性EPI很重要。此外,我们在GM12878细胞系中训练了HARD模型,并在HeLa细胞系中进行了测试。跨细胞系预测也表现良好,表明它有应用于其他细胞系的潜力。