MOE Key Laboratory of Information Fusion Technology, School of Automation, Northwestern Polytechnical University, Xi'an 710072, China.
MOE Key Laboratory of Information Fusion Technology, School of Automation, Northwestern Polytechnical University, Xi'an 710072, China.
Methods. 2022 Jul;203:207-213. doi: 10.1016/j.ymeth.2022.04.010. Epub 2022 Apr 21.
With the accumulation of ChIP-seq data, convolution neural network (CNN)-based methods have been proposed for predicting transcription factor binding sites (TFBSs). However, biological experimental data are noisy, and are often treated as ground truth for both training and testing. Particularly, existing classification methods ignore the false positive and false negative which are caused by the error in the peak calling stage, and therefore, they can easily overfit to biased training data. It leads to inaccurate identification and inability to reveal the rules of governing protein-DNA binding. To address this issue, we proposed a meta learning-based CNN method (namely TFBS_MLCNN or MLCNN for short) for suppressing the influence of noisy labels data and accurately recognizing TFBSs from ChIP-seq data. Guided by a small amount of unbiased meta-data, MLCNN can adaptively learn an explicit weighting function from ChIP-seq data and update the parameter of classifier simultaneously. The weighting function overcomes the influence of biased training data on classifier by assigning a weight to each sample according to its training loss. The experimental results on 424 ChIP-seq datasets show that MLCNN not only outperforms other existing state-of-the-art CNN methods, but can also detect noisy samples which are given the small weights to suppress them. The suppression ability to the noisy samples can be revealed through the visualization of samples' weights. Several case studies demonstrate that MLCNN has superior performance to others.
随着 ChIP-seq 数据的积累,已经提出了基于卷积神经网络 (CNN) 的方法来预测转录因子结合位点 (TFBS)。然而,生物实验数据存在噪声,并且通常被视为训练和测试的真实数据。特别是,现有的分类方法忽略了峰调用阶段错误导致的假阳性和假阴性,因此,它们很容易过度拟合有偏差的训练数据。这导致了不准确的识别和无法揭示蛋白质-DNA 结合的规则。为了解决这个问题,我们提出了一种基于元学习的 CNN 方法 (即 TFBS_MLCNN 或简称 MLCNN),用于抑制噪声标签数据的影响,并从 ChIP-seq 数据中准确识别 TFBS。在少量无偏元数据的指导下,MLCNN 可以自适应地从 ChIP-seq 数据中学习显式权重函数,并同时更新分类器的参数。权重函数通过根据训练损失为每个样本分配权重来克服有偏训练数据对分类器的影响。在 424 个 ChIP-seq 数据集上的实验结果表明,MLCNN 不仅优于其他现有的最先进的 CNN 方法,而且还可以检测到噪声样本,并对其赋予较小的权重以抑制它们。通过可视化样本的权重,可以揭示对噪声样本的抑制能力。几个案例研究表明,MLCNN 具有优于其他方法的性能。