School of Science and Technology, Department of Computer Science, Nottingham Trent University, Nottingham, UK.
Perceptronix Ltd, Hilton, Derbyshire, UK.
Methods Mol Biol. 2022;2449:349-386. doi: 10.1007/978-1-0716-2095-3_15.
Since the advent of high-throughput omics technologies, various molecular data such as genes, transcripts, proteins, and metabolites have been made widely available to researchers. This has afforded clinicians, bioinformaticians, statisticians, and data scientists the opportunity to apply their innovations in feature mining and predictive modeling to a rich data resource to develop a wide range of generalizable prediction models. What has become apparent over the last 10 years is that researchers have adopted deep neural networks (or "deep nets") as their preferred paradigm of choice for complex data modeling due to the superiority of performance over more traditional statistical machine learning approaches, such as support vector machines. A key stumbling block, however, is that deep nets inherently lack transparency and are considered to be a "black box" approach. This naturally makes it very difficult for clinicians and other stakeholders to trust their deep learning models even though the model predictions appear to be highly accurate. In this chapter, we therefore provide a detailed summary of the deep net architectures typically used in omics research, together with a comprehensive summary of the notable "deep feature mining" techniques researchers have applied to open up this black box and provide some insights into the salient input features and why these models behave as they do. We group these techniques into the following three categories: (a) hidden layer visualization and interpretation; (b) input feature importance and impact evaluation; and (c) output layer gradient analysis. While we find that omics researchers have made some considerable gains in opening up the black box through interpretation of the hidden layer weights and node activations to identify salient input features, we highlight other approaches for omics researchers, such as employing deconvolutional network-based approaches and development of bespoke attribute impact measures to enable researchers to better understand the relationships between the input data and hidden layer representations formed and thus the output behavior of their deep nets.
自高通量组学技术问世以来,各种分子数据(如基因、转录本、蛋白质和代谢物)已经广泛提供给研究人员。这使得临床医生、生物信息学家、统计学家和数据科学家有机会将他们的创新应用于特征挖掘和预测建模中,以丰富的数据资源开发广泛适用的预测模型。过去 10 年来,一个明显的趋势是,由于性能优于支持向量机等更传统的统计机器学习方法,研究人员已经将深度学习网络(或“深度网络”)作为他们首选的复杂数据建模范例。然而,一个关键的障碍是,深度网络本质上缺乏透明度,被认为是一种“黑箱”方法。这使得临床医生和其他利益相关者即使模型预测似乎非常准确,也很难信任他们的深度学习模型。因此,在本章中,我们详细总结了通常在组学研究中使用的深度网络架构,并全面总结了研究人员应用于打开黑箱的显著“深度特征挖掘”技术,以深入了解显著的输入特征以及这些模型为什么会表现出这样的行为。我们将这些技术分为以下三类:(a)隐藏层可视化和解释;(b)输入特征重要性和影响评估;(c)输出层梯度分析。虽然我们发现组学研究人员通过解释隐藏层权重和节点激活来识别显著输入特征,从而在打开黑箱方面取得了一些重大进展,但我们还强调了其他适用于组学研究人员的方法,例如采用去卷积网络方法和开发定制属性影响度量,以使研究人员能够更好地理解输入数据与隐藏层表示之间的关系,从而了解其深度网络的输出行为。