College of Information Engineering, Northwest A&F University, Yangling, 712100, Shaanxi, China.
School of Software, Shandong University, Jinan, 250101, Shandong, China.
Brief Bioinform. 2022 Mar 10;23(2). doi: 10.1093/bib/bbac023.
Chromosome is composed of many distinct chromatin domains, referred to variably as topological domains or topologically associating domains (TADs). The domains are stable across different cell types and highly conserved across species, thus these chromatin domains have been considered as the basic units of chromosome folding and regarded as an important secondary structure in chromosome organization. However, the identification of TAD boundaries is still a great challenge due to the high cost and low resolution of Hi-C data or experiments. In this study, we propose a novel ensemble learning framework, termed as StackTADB, for predicting the boundaries of TADs. StackTADB integrates four base classifiers including Random Forest, Logistic Regression, K-NearestNeighbor and Support Vector Machine. From the analysis of a series of examinations on the data set in the previous study, it is concluded that StackTADB has optimal performance in six metrics, AUC, Accuracy, MCC, Precision, Recall and F1 score, and it is superior to the existing methods. In addition, the comparison of the performance of multiple features shows that Kmers-based features play an essential role in predicting TADs boundaries of fruit flies, and we also apply the SHapley Additive exPlanations (SHAP) framework to interpret the predictions of StackTADB to identify the reason why Kmers-based features are vital. The experimental results show that the subsequences matching the BEAF-32 motif play a crucial role in predicting the boundaries of TADs. The source code is freely available at https://github.com/HaoWuLab-Bioinformatics/StackTADB and the webserver of StackTADB is freely available at http://hwtad.sdu.edu.cn:8002/StackTADB.
染色体由许多不同的染色质域组成,这些域被不同地称为拓扑结构域或拓扑关联域 (TAD)。这些结构域在不同的细胞类型中是稳定的,并且在物种间高度保守,因此这些染色质域被认为是染色体折叠的基本单位,并被视为染色体组织的重要二级结构。然而,由于 Hi-C 数据或实验的成本高和分辨率低,TAD 边界的识别仍然是一个巨大的挑战。在这项研究中,我们提出了一种新的集成学习框架,称为 StackTADB,用于预测 TAD 的边界。StackTADB 集成了包括随机森林、逻辑回归、K-最近邻和支持向量机在内的四个基础分类器。通过对先前研究中数据集的一系列检查的分析,得出 StackTADB 在六个指标(AUC、准确性、MCC、精度、召回率和 F1 得分)中具有最佳性能,并且优于现有方法。此外,对多种特征的性能比较表明,基于 Kmer 的特征在预测果蝇 TAD 边界方面起着重要作用,我们还应用 SHapley Additive exPlanations (SHAP) 框架来解释 StackTADB 的预测,以确定基于 Kmer 的特征至关重要的原因。实验结果表明,与 BEAF-32 基序匹配的子序列在预测 TAD 边界中起着关键作用。源代码可在 https://github.com/HaoWuLab-Bioinformatics/StackTADB 上免费获得,StackTADB 的网络服务器可在 http://hwtad.sdu.edu.cn:8002/StackTADB 上免费获得。