He Wenying, Xu Jialu, Zuo Yun, Bai Yude, Guo Fei
School of Artificial Intelligence, Hebei University of Technology, No. 5340, Xiping Road, Beichen District, Tianjin 300400, China.
Hebei Province Key Laboratory of Big Data Calculation, Hebei University of Technology, No. 5340, Xiping Road, Beichen District, Tianjin 300130, China.
Brief Funct Genomics. 2025 Jan 15;24. doi: 10.1093/bfgp/elaf003.
Super-enhancers (SEs) are typically located in the regulatory regions of genes, driving high-level gene expression. Identifying SEs is crucial for a deeper understanding of gene regulatory networks, disease mechanisms, and the development and physiological processes of organisms, thus exerting a profound impact on research and applications in the life sciences field. Traditional experimental methods for identifying SEs are costly and time-consuming. Existing methods for predicting SEs based solely on sequence data use deep learning for feature representation and have achieved good results. However, they overlook biological features related to physicochemical properties, leading to low interpretability. Additionally, the complex model structure often requires extensive labeled data for training, which limits their further application in biological data. In this paper, we integrate the strengths of different models and proposes an ensemble model based on an integration strategy to enhance the model's generalization ability. It designs a multi-angle feature representation method that combines local structure and global information to extract high-dimensional abstract relationships and key low-dimensional biological features from sequences. This enhances the effectiveness and interpretability of the model's input features, providing technical support for discovering cell-specific and species-specific patterns of SEs. We evaluated the performance on both mouse and human datasets using five metrics, including area under the receiver operating characteristic curve accuracy, and others. Compared to the latest models, EnsembleSE achieved an average improvement of 4.5% in F1 score and an average improvement of 8.05% in recall, demonstrating the robustness and adaptability of the model on a unified test set. Source codes are available at https://github.com/2103374200/EnsembleSE-main.
超级增强子(SEs)通常位于基因的调控区域,驱动高水平的基因表达。识别超级增强子对于深入理解基因调控网络、疾病机制以及生物体的发育和生理过程至关重要,从而对生命科学领域的研究和应用产生深远影响。传统的识别超级增强子的实验方法成本高且耗时。现有的仅基于序列数据预测超级增强子的方法使用深度学习进行特征表示,并取得了良好的效果。然而,它们忽略了与物理化学性质相关的生物学特征,导致可解释性较低。此外,复杂的模型结构通常需要大量的标记数据进行训练,这限制了它们在生物数据中的进一步应用。在本文中,我们整合了不同模型的优势,提出了一种基于集成策略的集成模型,以增强模型的泛化能力。它设计了一种多角度特征表示方法,将局部结构和全局信息相结合,从序列中提取高维抽象关系和关键的低维生物学特征。这提高了模型输入特征的有效性和可解释性,为发现细胞特异性和物种特异性的超级增强子模式提供了技术支持。我们使用包括受试者工作特征曲线下面积、准确率等五个指标在小鼠和人类数据集上评估了性能。与最新模型相比,EnsembleSE在F1分数上平均提高了4.5%,在召回率上平均提高了8.05%,证明了该模型在统一测试集上的稳健性和适应性。源代码可在https://github.com/2103374200/EnsembleSE-main获取。