CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai, 200031, China.
Brief Bioinform. 2024 Jan 22;25(2). doi: 10.1093/bib/bbae110.
Non-coding variants associated with complex traits can alter the motifs of transcription factor (TF)-deoxyribonucleic acid binding. Although many computational models have been developed to predict the effects of non-coding variants on TF binding, their predictive power lacks systematic evaluation. Here we have evaluated 14 different models built on position weight matrices (PWMs), support vector machines, ordinary least squares and deep neural networks (DNNs), using large-scale in vitro (i.e. SNP-SELEX) and in vivo (i.e. allele-specific binding, ASB) TF binding data. Our results show that the accuracy of each model in predicting SNP effects in vitro significantly exceeds that achieved in vivo. For in vitro variant impact prediction, kmer/gkm-based machine learning methods (deltaSVM_HT-SELEX, QBiC-Pred) trained on in vitro datasets exhibit the best performance. For in vivo ASB variant prediction, DNN-based multitask models (DeepSEA, Sei, Enformer) trained on the ChIP-seq dataset exhibit relatively superior performance. Among the PWM-based methods, tRap demonstrates better performance in both in vitro and in vivo evaluations. In addition, we find that TF classes such as basic leucine zipper factors could be predicted more accurately, whereas those such as C2H2 zinc finger factors are predicted less accurately, aligning with the evolutionary conservation of these TF classes. We also underscore the significance of non-sequence factors such as cis-regulatory element type, TF expression, interactions and post-translational modifications in influencing the in vivo predictive performance of TFs. Our research provides valuable insights into selecting prioritization methods for non-coding variants and further optimizing such models.
与复杂性状相关的非编码变异可改变转录因子(TF)-脱氧核糖核酸结合的基序。虽然已经开发了许多计算模型来预测非编码变异对 TF 结合的影响,但它们的预测能力缺乏系统评估。在这里,我们使用大规模的体外(即 SNP-SELEX)和体内(即等位基因特异性结合,ASB)TF 结合数据,评估了基于位置权重矩阵(PWMs)、支持向量机、普通最小二乘法和深度神经网络(DNNs)构建的 14 种不同模型。我们的结果表明,每种模型在预测体外 SNP 效应的准确性上显著优于体内。对于体外变体影响预测,基于 kmer/gkm 的机器学习方法(deltaSVM_HT-SELEX,QBiC-Pred)在体外数据集上进行训练,表现出最佳性能。对于体内 ASB 变体预测,基于 DNN 的多任务模型(DeepSEA、Sei、Enformer)在 ChIP-seq 数据集上进行训练,表现出相对优越的性能。在基于 PWM 的方法中,tRap 在体外和体内评估中都表现出更好的性能。此外,我们发现,基本亮氨酸拉链因子等 TF 类可以更准确地预测,而 C2H2 锌指因子等 TF 类则预测准确性较低,这与这些 TF 类的进化保守性一致。我们还强调了顺式调节元件类型、TF 表达、相互作用和翻译后修饰等非序列因素在影响 TF 体内预测性能方面的重要性。我们的研究为选择非编码变异的优先级方法提供了有价值的见解,并进一步优化了这些模型。