Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America.
Department of Biostatistics, Harvard University, Cambridge, Massachusetts, United States of America.
PLoS Comput Biol. 2021 May 13;17(5):e1008925. doi: 10.1371/journal.pcbi.1008925. eCollection 2021 May.
Deep neural networks have demonstrated improved performance at predicting the sequence specificities of DNA- and RNA-binding proteins compared to previous methods that rely on k-mers and position weight matrices. To gain insights into why a DNN makes a given prediction, model interpretability methods, such as attribution methods, can be employed to identify motif-like representations along a given sequence. Because explanations are given on an individual sequence basis and can vary substantially across sequences, deducing generalizable trends across the dataset and quantifying their effect size remains a challenge. Here we introduce global importance analysis (GIA), a model interpretability method that quantifies the population-level effect size that putative patterns have on model predictions. GIA provides an avenue to quantitatively test hypotheses of putative patterns and their interactions with other patterns, as well as map out specific functions the network has learned. As a case study, we demonstrate the utility of GIA on the computational task of predicting RNA-protein interactions from sequence. We first introduce a convolutional network, we call ResidualBind, and benchmark its performance against previous methods on RNAcompete data. Using GIA, we then demonstrate that in addition to sequence motifs, ResidualBind learns a model that considers the number of motifs, their spacing, and sequence context, such as RNA secondary structure and GC-bias.
深度神经网络在预测 DNA 和 RNA 结合蛋白的序列特异性方面的表现优于以前依赖于 K -mer 和位置权重矩阵的方法。为了深入了解为什么 DNN 做出特定的预测,可以使用模型可解释性方法(如归因方法)来识别给定序列上的基序样表示。由于解释是基于单个序列给出的,并且在序列之间可能有很大差异,因此推断整个数据集的可推广趋势并量化其效应大小仍然是一个挑战。在这里,我们引入了全局重要性分析(GIA),这是一种模型可解释性方法,用于量化假定模式对模型预测的群体效应大小。GIA 提供了一种定量检验假定模式及其与其他模式相互作用的假设的方法,以及映射网络所学的具体功能的方法。作为一个案例研究,我们展示了 GIA 在从序列预测 RNA-蛋白质相互作用的计算任务中的效用。我们首先引入了一个卷积网络,我们称之为 ResidualBind,并在 RNAcompete 数据上对其性能进行基准测试,与以前的方法进行比较。然后,我们使用 GIA 证明,除了序列基序外,ResidualBind 还学习了一种模型,该模型考虑了基序的数量、它们的间隔以及序列上下文,例如 RNA 二级结构和 GC 偏倚。