使用充分的输入子集对蛋白质家族分类模型进行评估

Critiquing Protein Family Classification Models Using Sufficient Input Subsets.

作者信息

Carter Brandon, Bileschi Maxwell, Smith Jamie, Sanderson Theo, Bryant Drew, Belanger David, Colwell Lucy J

机构信息

MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts, USA.

Google Research, Mountain View, California, USA.

出版信息

J Comput Biol. 2020 Aug;27(8):1219-1231. doi: 10.1089/cmb.2019.0339. Epub 2019 Dec 23.

DOI:10.1089/cmb.2019.0339

PMID:31874057

Abstract

In many application domains, neural networks are highly accurate and have been deployed at large scale. However, users often do not have good tools for understanding how these models arrive at their predictions. This has hindered adoption in fields such as the life and medical sciences, where researchers require that models base their decisions on underlying biological phenomena rather than peculiarities of the dataset. We propose a set of methods for critiquing deep learning models and demonstrate their application for protein family classification, a task for which high-accuracy models have considerable potential impact. Our methods extend the Sufficient Input Subsets (SIS) technique, which we use to identify subsets of features in each protein sequence that are alone sufficient for classification. Our suite of tools analyzes these subsets to shed light on the decision-making criteria employed by models trained on this task. These tools show that while deep models may perform classification for biologically relevant reasons, their behavior varies considerably across the choice of network architecture and parameter initialization. While the techniques that we develop are specific to the protein sequence classification task, the approach taken generalizes to a broad set of scientific contexts in which model interpretability is essential.

摘要

在许多应用领域，神经网络具有很高的准确性并已大规模部署。然而，用户通常没有很好的工具来理解这些模型是如何做出预测的。这阻碍了其在生命科学和医学等领域的应用，在这些领域中，研究人员要求模型基于潜在的生物学现象而非数据集的特性来做出决策。我们提出了一套用于评估深度学习模型的方法，并展示了它们在蛋白质家族分类中的应用，对于该任务，高精度模型具有相当大的潜在影响。我们的方法扩展了充分输入子集（SIS）技术，我们用它来识别每个蛋白质序列中足以单独进行分类的特征子集。我们的工具套件分析这些子集，以阐明在此任务上训练的模型所采用的决策标准。这些工具表明，虽然深度模型可能出于生物学相关原因进行分类，但其行为在网络架构和参数初始化的选择上有很大差异。虽然我们开发的技术特定于蛋白质序列分类任务，但所采用的方法可推广到广泛的科学背景中，在这些背景下模型可解释性至关重要。

相似文献

Critiquing Protein Family Classification Models Using Sufficient Input Subsets.

J Comput Biol. 2020 Aug;27(8):1219-1231. doi: 10.1089/cmb.2019.0339. Epub 2019 Dec 23.

Probing machine-learning classifiers using noise, bubbles, and reverse correlation.

J Neurosci Methods. 2021 Oct 1;362:109297. doi: 10.1016/j.jneumeth.2021.109297. Epub 2021 Jul 25.

Transferability of artificial neural networks for clinical document classification across hospitals: A case study on abnormality detection from radiology reports.

J Biomed Inform. 2018 Sep;85:68-79. doi: 10.1016/j.jbi.2018.07.017. Epub 2018 Jul 17.

Deep learning for electroencephalogram (EEG) classification tasks: a review.

J Neural Eng. 2019 Jun;16(3):031001. doi: 10.1088/1741-2552/ab0ab5. Epub 2019 Feb 26.

Evaluating shallow and deep learning strategies for the 2018 n2c2 shared task on clinical text classification.

J Am Med Inform Assoc. 2019 Nov 1;26(11):1247-1254. doi: 10.1093/jamia/ocz149.

Deep neural networks for human microRNA precursor detection.

BMC Bioinformatics. 2020 Jan 13;21(1):17. doi: 10.1186/s12859-020-3339-7.

A novel end-to-end classifier using domain transferred deep convolutional neural networks for biomedical images.

Comput Methods Programs Biomed. 2017 Mar;140:283-293. doi: 10.1016/j.cmpb.2016.12.019. Epub 2017 Jan 6.

Novel deep neural network based pattern field classification architectures.

Neural Netw. 2020 Jul;127:82-95. doi: 10.1016/j.neunet.2020.03.011. Epub 2020 Mar 14.

Multimodal deep representation learning for protein interaction identification and protein family classification.

BMC Bioinformatics. 2019 Dec 2;20(Suppl 16):531. doi: 10.1186/s12859-019-3084-y.

Stress detection using deep neural networks.

BMC Med Inform Decis Mak. 2020 Dec 30;20(Suppl 11):285. doi: 10.1186/s12911-020-01299-4.

引用本文的文献

Decoding biology with massively parallel reporter assays and machine learning.

Genes Dev. 2024 Oct 16;38(17-20):843-865. doi: 10.1101/gad.351800.124.

Towards the adoption of quantitative computed tomography in the management of interstitial lung disease.

Eur Respir Rev. 2024 Mar 27;33(171). doi: 10.1183/16000617.0055-2023. Print 2024 Jan 31.

Interpreting Neural Networks for Biological Sequences by Learning Stochastic Masks.

Nat Mach Intell. 2022 Jan;4(1):41-54. doi: 10.1038/s42256-021-00428-6. Epub 2022 Jan 25.

An improved deep learning model for hierarchical classification of protein families.

PLoS One. 2021 Oct 20;16(10):e0258625. doi: 10.1371/journal.pone.0258625. eCollection 2021.

Improving protein domain classification for third-generation sequencing reads using deep learning.

BMC Genomics. 2021 Apr 9;22(1):251. doi: 10.1186/s12864-021-07468-7.

Antibody complementarity determining region design using high-capacity machine learning.

Bioinformatics. 2020 Apr 1;36(7):2126-2133. doi: 10.1093/bioinformatics/btz895.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用充分的输入子集对蛋白质家族分类模型进行评估

Critiquing Protein Family Classification Models Using Sufficient Input Subsets.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献