Huang Pan, He Peng, Tian Sukun, Ma Mingrui, Feng Peng, Xiao Hualiang, Mercaldo Francesco, Santone Antonella, Qin Jing
IEEE Trans Med Imaging. 2023 Jan;42(1):15-28. doi: 10.1109/TMI.2022.3202248. Epub 2022 Dec 29.
The tumor grading of laryngeal cancer pathological images needs to be accurate and interpretable. The deep learning model based on the attention mechanism-integrated convolution (AMC) block has good inductive bias capability but poor interpretability, whereas the deep learning model based on the vision transformer (ViT) block has good interpretability but weak inductive bias ability. Therefore, we propose an end-to-end ViT-AMC network (ViT-AMCNet) with adaptive model fusion and multiobjective optimization that integrates and fuses the ViT and AMC blocks. However, existing model fusion methods often have negative fusion: 1). There is no guarantee that the ViT and AMC blocks will simultaneously have good feature representation capability. 2). The difference in feature representations learning between the ViT and AMC blocks is not obvious, so there is much redundant information in the two feature representations. Accordingly, we first prove the feasibility of fusing the ViT and AMC blocks based on Hoeffding's inequality. Then, we propose a multiobjective optimization method to solve the problem that ViT and AMC blocks cannot simultaneously have good feature representation. Finally, an adaptive model fusion method integrating the metrics block and the fusion block is proposed to increase the differences between feature representations and improve the deredundancy capability. Our methods improve the fusion ability of ViT-AMCNet, and experimental results demonstrate that ViT-AMCNet significantly outperforms state-of-the-art methods. Importantly, the visualized interpretive maps are closer to the region of interest of concern by pathologists, and the generalization ability is also excellent. Our code is publicly available at https://github.com/Baron-Huang/ViT-AMCNet.
喉癌病理图像的肿瘤分级需要准确且具有可解释性。基于注意力机制集成卷积(AMC)模块的深度学习模型具有良好的归纳偏差能力,但可解释性较差;而基于视觉Transformer(ViT)模块的深度学习模型具有良好的可解释性,但归纳偏差能力较弱。因此,我们提出了一种具有自适应模型融合和多目标优化的端到端ViT-AMC网络(ViT-AMCNet),该网络集成并融合了ViT和AMC模块。然而,现有的模型融合方法往往存在负融合问题:1). 无法保证ViT和AMC模块同时具有良好的特征表示能力。2). ViT和AMC模块之间的特征表示学习差异不明显,因此两种特征表示中存在大量冗余信息。相应地,我们首先基于霍夫丁不等式证明了融合ViT和AMC模块的可行性。然后,我们提出了一种多目标优化方法来解决ViT和AMC模块不能同时具有良好特征表示的问题。最后,提出了一种集成度量模块和融合模块的自适应模型融合方法,以增加特征表示之间的差异并提高去冗余能力。我们的方法提高了ViT-AMCNet的融合能力,实验结果表明ViT-AMCNet显著优于现有方法。重要的是,可视化解释图更接近病理学家关注的感兴趣区域,并且泛化能力也很出色。我们的代码可在https://github.com/Baron-Huang/ViT-AMCNet上公开获取。