School of Computational Technologies, RMIT University, Melbourne VIC 3000, Australia; Department of Infectious Diseases, Alfred Hospital, Prahran VIC 3008, Australia.
Faculty of Information Technology, Monash University, Clayton VIC 3800, Australia.
Comput Biol Med. 2024 Mar;171:108068. doi: 10.1016/j.compbiomed.2024.108068. Epub 2024 Feb 10.
The availability of large-scale epigenomic data from various cell types and conditions has yielded valuable insights for evaluating and learning features predicting the co-binding of transcription factors (TF). However, prior attempts to develop models predicting motif co-occurrence lacked scalability for globally analyzing any motif combination or making cross-species predictions. Moreover, mapping co-regulatory modules (CRM) to gene regulatory networks (GRN) is crucial for understanding underlying function. Currently, no comprehensive pipeline exists for large-scale, rapid, and accurate CRM and GRN identification. In this study, we analyzed and evaluated different TF binding characteristics facilitating biologically significant co-binding to identify all potential clusters of co-binding TFs. We curated the UniBind database, containing ChIP-Seq data from over 1983 samples and 232 TFs, and implemented two machine learning models to predict CRMs and the potential regulatory networks they operate on. Two machine learning models, Convolution Neural Networks (CNN) and Random Forest Classifier(RFC), used to predict co-binding between TFs, were compared using precision-recall Receiver Operating Characteristic (ROC) curves. CNN outperformed RFC (AUC 0.94 vs. 0.88) and achieved higher F1 scores (0.938 vs. 0.872). The CRMs generated by the clustering algorithm were validated against ChipAtlas and MCOT, revealing additional motifs forming CRMs. We predicted 200k CRMs for 50k+ human genes, validated against recent CRM prediction methods with 100% overlap. Further, we narrowed our focus to study heart-related regulatory motifs, filtering the generated CRMs to report 1784 Cardiac CRMs containing at least four cardiac TFs. Identified cardiac CRMs revealed potential novel regulators like ARID3A and RXRB for SCAD, including known TFs like PPARG for F11R. Our findings highlight the importance of the NKX family of transcription factors in cardiac development and provide potential targets for further investigation in cardiac disease.
来自各种细胞类型和条件的大规模表观基因组数据的可用性为评估和学习预测转录因子(TF)共同结合的特征提供了有价值的见解。然而,先前开发预测基序共现模型的尝试缺乏可扩展性,无法全局分析任何基序组合或进行跨物种预测。此外,将共调控模块(CRM)映射到基因调控网络(GRN)对于理解潜在功能至关重要。目前,不存在用于大规模、快速和准确识别 CRM 和 GRN 的综合管道。在这项研究中,我们分析和评估了不同的 TF 结合特征,以促进具有生物学意义的共结合,以识别所有潜在的共结合 TF 簇。我们对 UniBind 数据库进行了分析和评估,该数据库包含来自 1983 多个样本和 232 个 TF 的 ChIP-Seq 数据,并实施了两种机器学习模型来预测 CRM 及其潜在的调控网络。使用精度-召回率接收者操作特征(ROC)曲线比较了用于预测 TF 之间共结合的两种机器学习模型,卷积神经网络(CNN)和随机森林分类器(RFC)。CNN 优于 RFC(AUC 为 0.94 对 0.88),并且实现了更高的 F1 分数(0.938 对 0.872)。使用聚类算法生成的 CRM 与 ChipAtlas 和 MCOT 进行了验证,揭示了形成 CRM 的其他基序。我们预测了 50k+人类基因的 200k CRM,与最近的 CRM 预测方法的验证结果有 100%的重叠。此外,我们将重点缩小到研究与心脏相关的调节基序,过滤生成的 CRM 以报告包含至少四个心脏 TF 的 1784 个心脏 CRM。鉴定出的心脏 CRM 揭示了 ARID3A 和 RXRB 等潜在的新调节因子,以及 PPARG 等已知的 TF 用于 SCAD。我们的研究结果强调了转录因子 NKX 家族在心脏发育中的重要性,并为心脏疾病的进一步研究提供了潜在的靶点。