Suppr超能文献

使用充分的输入子集对蛋白质家族分类模型进行评估

Critiquing Protein Family Classification Models Using Sufficient Input Subsets.

作者信息

Carter Brandon, Bileschi Maxwell, Smith Jamie, Sanderson Theo, Bryant Drew, Belanger David, Colwell Lucy J

机构信息

MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts, USA.

Google Research, Mountain View, California, USA.

出版信息

J Comput Biol. 2020 Aug;27(8):1219-1231. doi: 10.1089/cmb.2019.0339. Epub 2019 Dec 23.

Abstract

In many application domains, neural networks are highly accurate and have been deployed at large scale. However, users often do not have good tools for understanding how these models arrive at their predictions. This has hindered adoption in fields such as the life and medical sciences, where researchers require that models base their decisions on underlying biological phenomena rather than peculiarities of the dataset. We propose a set of methods for critiquing deep learning models and demonstrate their application for protein family classification, a task for which high-accuracy models have considerable potential impact. Our methods extend the Sufficient Input Subsets (SIS) technique, which we use to identify subsets of features in each protein sequence that are alone sufficient for classification. Our suite of tools analyzes these subsets to shed light on the decision-making criteria employed by models trained on this task. These tools show that while deep models may perform classification for biologically relevant reasons, their behavior varies considerably across the choice of network architecture and parameter initialization. While the techniques that we develop are specific to the protein sequence classification task, the approach taken generalizes to a broad set of scientific contexts in which model interpretability is essential.

摘要

在许多应用领域,神经网络具有很高的准确性并已大规模部署。然而,用户通常没有很好的工具来理解这些模型是如何做出预测的。这阻碍了其在生命科学和医学等领域的应用,在这些领域中,研究人员要求模型基于潜在的生物学现象而非数据集的特性来做出决策。我们提出了一套用于评估深度学习模型的方法,并展示了它们在蛋白质家族分类中的应用,对于该任务,高精度模型具有相当大的潜在影响。我们的方法扩展了充分输入子集(SIS)技术,我们用它来识别每个蛋白质序列中足以单独进行分类的特征子集。我们的工具套件分析这些子集,以阐明在此任务上训练的模型所采用的决策标准。这些工具表明,虽然深度模型可能出于生物学相关原因进行分类,但其行为在网络架构和参数初始化的选择上有很大差异。虽然我们开发的技术特定于蛋白质序列分类任务,但所采用的方法可推广到广泛的科学背景中,在这些背景下模型可解释性至关重要。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验