Ge Wanwan, Meier Markus, Roth Christian, Söding Johannes
Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany.
NAR Genom Bioinform. 2021 Apr 20;3(2):lqab026. doi: 10.1093/nargab/lqab026. eCollection 2021 Jun.
Transcription factors (TFs) regulate gene expression by binding to specific DNA motifs. Accurate models for predicting binding affinities are crucial for quantitatively understanding of transcriptional regulation. Motifs are commonly described by position weight matrices, which assume that each position contributes independently to the binding energy. Models that can learn dependencies between positions, for instance, induced by DNA structure preferences, have yielded markedly improved predictions for most TFs on data. However, they are more prone to overfit the data and to learn patterns merely correlated with rather than directly involved in TF binding. We present an improved, faster version of our Bayesian Markov model software, BaMMmotif2. We tested it with state-of-the-art motif discovery tools on a large collection of ChIP-seq and HT-SELEX datasets. BaMMmotif2 models of fifth-order achieved a median false-discovery-rate-averaged recall 13.6% and 12.2% higher than the next best tool on 427 ChIP-seq datasets and 164 HT-SELEX datasets, respectively, while being 8 to 1000 times faster. BaMMmotif2 models showed no signs of overtraining in cross-cell line and cross-platform tests, with similar improvements on the next-best tool. These results demonstrate that dependencies beyond first order clearly improve binding models for most TFs.
转录因子(TFs)通过与特定的DNA基序结合来调节基因表达。准确预测结合亲和力的模型对于定量理解转录调控至关重要。基序通常由位置权重矩阵描述,该矩阵假设每个位置对结合能的贡献是独立的。能够学习位置之间依赖性的模型,例如由DNA结构偏好引起的依赖性,在大多数TF的数据上产生了显著改进的预测。然而,它们更容易过度拟合数据并学习仅仅与TF结合相关而非直接参与TF结合的模式。我们展示了我们的贝叶斯马尔可夫模型软件BaMMmotif2的一个改进的、更快的版本。我们在大量ChIP-seq和HT-SELEX数据集上使用最先进的基序发现工具对其进行了测试。在427个ChIP-seq数据集和164个HT-SELEX数据集上,五阶的BaMMmotif2模型实现的中位错误发现率平均召回率分别比次优工具高13.6%和12.2%,同时速度快8到1000倍。在跨细胞系和跨平台测试中,BaMMmotif2模型没有过度训练的迹象,对次优工具也有类似的改进。这些结果表明,对于大多数TF而言,一阶以上的依赖性明显改善了结合模型。