Linder Johannes, La Fleur Alyssa, Chen Zibo, Ljubeti Ajasja, Baker David, Kannan Sreeram, Seelig Georg
Paul G. Allen School of Computer Science and Engineering, University of Washington.
Institute for Protein Design, University of Washington.
Nat Mach Intell. 2022 Jan;4(1):41-54. doi: 10.1038/s42256-021-00428-6. Epub 2022 Jan 25.
Sequence-based neural networks can learn to make accurate predictions from large biological datasets, but model interpretation remains challenging. Many existing feature attribution methods are optimized for continuous rather than discrete input patterns and assess individual feature importance in isolation, making them ill-suited for interpreting non-linear interactions in molecular sequences. Building on work in computer vision and natural language processing, we developed an approach based on deep learning - Scrambler networks - wherein the most salient sequence positions are identified with learned input masks. Scramblers learn to predict Position-Specific Scoring Matrices () where unimportant nucleotides or residues are scrambled by raising their entropy. We apply Scramblers to interpret the effects of genetic variants, uncover non-linear interactions between cis-regulatory elements, explain binding specificity for protein-protein interactions, and identify structural determinants of designed proteins. We show that Scramblers enable efficient attribution across large datasets and result in high-quality explanations, often outperforming state-of-the-art methods.
基于序列的神经网络可以从大型生物数据集中学习进行准确预测,但模型解释仍然具有挑战性。许多现有的特征归因方法是针对连续而非离散输入模式进行优化的,并且孤立地评估单个特征的重要性,这使得它们不适用于解释分子序列中的非线性相互作用。基于计算机视觉和自然语言处理的工作,我们开发了一种基于深度学习的方法——加扰网络,其中通过学习到的输入掩码来识别最显著的序列位置。加扰网络学习预测特定位置评分矩阵(PSSM),其中不重要的核苷酸或残基通过提高其熵进行加扰。我们应用加扰网络来解释基因变异的影响,揭示顺式调控元件之间的非线性相互作用,解释蛋白质-蛋白质相互作用的结合特异性,并识别设计蛋白质的结构决定因素。我们表明,加扰网络能够在大型数据集中进行高效归因,并产生高质量的解释,通常优于现有最先进的方法。