IEEE/ACM Trans Comput Biol Bioinform. 2020 Mar-Apr;17(2):679-689. doi: 10.1109/TCBB.2018.2864203. Epub 2018 Aug 7.
Although convolutional neural networks (CNN) have outperformed conventional methods in predicting the sequence specificities of protein-DNA binding in recent years, they do not take full advantage of the intrinsic weakly-supervised information of DNA sequences that a bound sequence may contain multiple TFBS(s). Here, we propose a weakly-supervised convolutional neural network architecture (WSCNN), combining multiple-instance learning (MIL) with CNN, to further boost the performance of predicting protein-DNA binding. WSCNN first divides each DNA sequence into multiple overlapping subsequences (instances) with a sliding window, and then separately models each instance using CNN, and finally fuses the predicted scores of all instances in the same bag using four fusion methods, including Max, Average, Linear Regression, and Top-Bottom Instances. The experimental results on in vivo and in vitro datasets illustrate the performance of the proposed approach. Moreover, models built on in vitro data using WSCNN can predict in vivo protein-DNA binding with good accuracy. In addition, we give a quantitative analysis of the importance of the reverse-complement mode in predicting in vivo protein-DNA binding, and explain why not directly use advanced pooling layers to combine MIL with CNN, through a series of experiments.
虽然卷积神经网络(CNN)在近年来预测蛋白质-DNA 结合的序列特异性方面已经超越了传统方法,但它们并没有充分利用 DNA 序列内在的弱监督信息,即一个结合序列可能包含多个 TFBS(转录因子结合位点)。在这里,我们提出了一种弱监督卷积神经网络架构(WSCNN),将多实例学习(MIL)与 CNN 相结合,以进一步提高蛋白质-DNA 结合预测的性能。WSCNN 首先使用滑动窗口将每个 DNA 序列划分为多个重叠的子序列(实例),然后分别使用 CNN 对每个实例进行建模,最后使用四种融合方法(包括 Max、Average、Linear Regression 和 Top-Bottom Instances)融合同一袋中所有实例的预测得分。体内和体外数据集上的实验结果说明了所提出方法的性能。此外,使用 WSCNN 在体外数据上构建的模型可以很好地预测体内蛋白质-DNA 结合。此外,我们通过一系列实验对在预测体内蛋白质-DNA 结合中反向互补模式的重要性进行了定量分析,并解释了为什么不直接使用高级池化层将 MIL 与 CNN 相结合。