Sasse Alexander, Chikina Maria, Mostafavi Sara
Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, USA.
Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 16354, USA.
iScience. 2024 Aug 23;27(9):110807. doi: 10.1016/j.isci.2024.110807. eCollection 2024 Sep 20.
To understand the decision process of genomic sequence-to-function models, explainable AI algorithms determine the importance of each nucleotide in a given input sequence to the model's predictions and enable discovery of regulatory motifs for gene regulation. The most commonly applied method is saturation mutagenesis (ISM) because its per-nucleotide importance scores can be intuitively understood as the computational counterpart to saturation mutagenesis experiments. While ISM is highly interpretable, it is computationally challenging to perform for many sequences, and becomes prohibitive as the length of the input sequences and size of the model grows. Here, we use the first-order Taylor approximation to approximate ISM values from the model's gradient, which reduces its computation cost to a single forward pass for an input sequence. We show that the Taylor ISM (TISM) approximation is robust across different model ablations, random initializations, training parameters, and dataset sizes.
为了理解基因组序列到功能模型的决策过程,可解释人工智能算法确定给定输入序列中每个核苷酸对模型预测的重要性,并有助于发现基因调控的调控基序。最常用的方法是饱和诱变(ISM),因为其每个核苷酸的重要性得分可以直观地理解为饱和诱变实验的计算对应物。虽然ISM具有高度可解释性,但对许多序列进行计算具有挑战性,并且随着输入序列长度和模型大小的增加而变得难以承受。在这里,我们使用一阶泰勒近似从模型梯度近似ISM值,这将其计算成本降低到对输入序列的单次前向传递。我们表明,泰勒ISM(TISM)近似在不同的模型消融、随机初始化、训练参数和数据集大小方面都很稳健。