Barić Domjan, Fumić Petar, Horvatić Davor, Lipic Tomislav
Department of Physics, Faculty of Science, University of Zagreb, Bijenička cesta 32, 10000 Zagreb, Croatia.
Division of Electronics, Ruđer Bošković Institute, Bijenička cesta 54, 10000 Zagreb, Croatia.
Entropy (Basel). 2021 Jan 25;23(2):143. doi: 10.3390/e23020143.
The adaptation of deep learning models within safety-critical systems cannot rely only on good prediction performance but needs to provide interpretable and robust explanations for their decisions. When modeling complex sequences, attention mechanisms are regarded as the established approach to support deep neural networks with intrinsic interpretability. This paper focuses on the emerging trend of specifically designing diagnostic datasets for understanding the inner workings of attention mechanism based deep learning models for multivariate forecasting tasks. We design a novel benchmark of synthetically designed datasets with the transparent underlying generating process of multiple time series interactions with increasing complexity. The benchmark enables empirical evaluation of the performance of attention based deep neural networks in three different aspects: (i) prediction performance score, (ii) interpretability correctness, (iii) sensitivity analysis. Our analysis shows that although most models have satisfying and stable prediction performance results, they often fail to give correct interpretability. The only model with both a satisfying performance score and correct interpretability is IMV-LSTM, capturing both autocorrelations and crosscorrelations between multiple time series. Interestingly, while evaluating IMV-LSTM on simulated data from statistical and mechanistic models, the correctness of interpretability increases with more complex datasets.
安全关键系统中深度学习模型的适配不能仅依赖于良好的预测性能,还需要为其决策提供可解释且稳健的解释。在对复杂序列进行建模时,注意力机制被视为支持具有内在可解释性的深度神经网络的既定方法。本文关注的是一种新趋势,即专门设计诊断数据集,以理解基于注意力机制的深度学习模型在多变量预测任务中的内部工作原理。我们设计了一个新颖的综合设计数据集基准,其具有多个时间序列交互的透明底层生成过程,且复杂度不断增加。该基准能够从三个不同方面对基于注意力的深度神经网络的性能进行实证评估:(i)预测性能得分,(ii)可解释性正确性,(iii)敏感性分析。我们的分析表明,尽管大多数模型具有令人满意且稳定的预测性能结果,但它们往往无法给出正确的可解释性。唯一具有令人满意的性能得分和正确可解释性的模型是IMV-LSTM,它同时捕捉了多个时间序列之间的自相关性和互相关性。有趣的是,在基于统计和机制模型的模拟数据上评估IMV-LSTM时,可解释性的正确性会随着数据集复杂度的增加而提高。