Jing Fang, Zhang Shao-Wu, Cao Zhen, Zhang Shihua
IEEE/ACM Trans Comput Biol Bioinform. 2021 Jan-Feb;18(1):355-364. doi: 10.1109/TCBB.2019.2901789. Epub 2021 Feb 3.
Knowing the transcription factor binding sites (TFBSs) is essential for modeling the underlying binding mechanisms and follow-up cellular functions. Convolutional neural networks (CNNs) have outperformed methods in predicting TFBSs from the primary DNA sequence. In addition to DNA sequences, histone modifications and chromatin accessibility are also important factors influencing their activity. They have been explored to predict TFBSs recently. However, current methods rarely take into account histone modifications and chromatin accessibility using CNN in an integrative framework. To this end, we developed a general CNN model to integrate these data for predicting TFBSs. We systematically benchmarked a series of architecture variants by changing network structure in terms of width and depth, and explored the effects of sample length at flanking regions. We evaluated the performance of the three types of data and their combinations using 256 ChIP-seq experiments and also compared it with competing machine learning methods. We find that contributions from these three types of data are complementary to each other. Moreover, the integrative CNN framework is superior to traditional machine learning methods with significant improvements.
了解转录因子结合位点(TFBSs)对于构建潜在的结合机制模型和后续的细胞功能至关重要。卷积神经网络(CNN)在从原始DNA序列预测TFBSs方面表现优于其他方法。除了DNA序列外,组蛋白修饰和染色质可及性也是影响其活性的重要因素。最近人们已经探索利用它们来预测TFBSs。然而,目前的方法很少在一个综合框架中使用CNN考虑组蛋白修饰和染色质可及性。为此,我们开发了一个通用的CNN模型来整合这些数据以预测TFBSs。我们通过改变网络结构的宽度和深度系统地对一系列架构变体进行了基准测试,并探讨了侧翼区域样本长度的影响。我们使用256个ChIP-seq实验评估了这三种类型数据及其组合的性能,并将其与竞争的机器学习方法进行了比较。我们发现这三种类型数据的贡献相互补充。此外,综合CNN框架优于传统机器学习方法,有显著改进。