基因组学中的可解释人工智能：基于专家混合模型的转录因子结合位点预测

Explainable AI in Genomics: Transcription Factor Binding Site Prediction with Mixture of Experts.

作者信息

Tripathi Aakash, Nielsen Ian E, Umer Muhammad, Ramachandran Ravi P, Rasool Ghulam

机构信息

Machine Learning, Moffitt Cancer Center, 12902 USF Magnolia Drive, Tampa, FL, 33612, USA.

Department of Electrical and Computer Engineering, Rowan University, Glassboro, NJ, 08028.

出版信息

ArXiv. 2025 Jul 18:arXiv:2507.09754v2.

PMID:40709306

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12288655/

Abstract

Transcription Factor Binding Site (TFBS) prediction is crucial for understanding gene regulation and various biological processes. This study introduces a novel Mixture of Experts (MoE) approach for TFBS prediction, integrating multiple pre-trained Convolutional Neural Network (CNN) models, each specializing in different TFBS patterns. We evaluate the performance of our MoE model against individual expert models on both in-distribution and out-of-distribution (OOD) datasets, using six randomly selected transcription factors (TFs) for OOD testing. Our results demonstrate that the MoE model achieves competitive or superior performance across diverse TF binding sites, particularly excelling in OOD scenarios. The Analysis of Variance (ANOVA) statistical test confirms the significance of these performance differences. Additionally, we introduce ShiftSmooth, a novel attribution mapping technique that provides more robust model interpretability by considering small shifts in input sequences. Through comprehensive explainability analysis, we show that ShiftSmooth offers superior attribution for motif discovery and localization compared to traditional Vanilla Gradient methods. Our work presents an efficient, generalizable, and interpretable solution for TFBS prediction, potentially enabling new discoveries in genome biology and advancing our understanding of transcriptional regulation.

摘要

转录因子结合位点（TFBS）预测对于理解基因调控和各种生物过程至关重要。本研究引入了一种用于TFBS预测的新型专家混合（MoE）方法，该方法整合了多个预训练的卷积神经网络（CNN）模型，每个模型专门针对不同的TFBS模式。我们使用六个随机选择的转录因子（TFs）进行分布外（OOD）测试，在分布内和分布外（OOD）数据集上针对单个专家模型评估了我们的MoE模型的性能。我们的结果表明，MoE模型在各种TF结合位点上实现了有竞争力或更优的性能，特别是在OOD场景中表现出色。方差分析（ANOVA）统计检验证实了这些性能差异的显著性。此外，我们引入了ShiftSmooth，这是一种新颖的归因映射技术，通过考虑输入序列中的小偏移来提供更强大的模型可解释性。通过全面的可解释性分析，我们表明与传统的香草梯度方法相比，ShiftSmooth在基序发现和定位方面提供了更好的归因。我们的工作为TFBS预测提供了一种高效、可推广且可解释的解决方案，有可能在基因组生物学中实现新的发现，并推进我们对转录调控的理解。