Inception Institute of AI, Abu Dhabi, United Arab Emirates; Faculty of IT, Monash University, Melbourne, Australia.
École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland; Lausanne University Hospital (CHUV), Lausanne, Switzerland.
Med Image Anal. 2024 Apr;93:103075. doi: 10.1016/j.media.2023.103075. Epub 2024 Jan 6.
Informative sample selection in an active learning (AL) setting helps a machine learning system attain optimum performance with minimum labeled samples, thus reducing annotation costs and boosting performance of computer-aided diagnosis systems in the presence of limited labeled data. Another effective technique to enlarge datasets in a small labeled data regime is data augmentation. An intuitive active learning approach thus consists of combining informative sample selection and data augmentation to leverage their respective advantages and improve the performance of AL systems. In this paper, we propose a novel approach called GANDALF (Graph-based TrANsformer and Data Augmentation Active Learning Framework) to combine sample selection and data augmentation in a multi-label setting. Conventional sample selection approaches in AL have mostly focused on the single-label setting where a sample has only one disease label. These approaches do not perform optimally when a sample can have multiple disease labels (e.g., in chest X-ray images). We improve upon state-of-the-art multi-label active learning techniques by representing disease labels as graph nodes and use graph attention transformers (GAT) to learn more effective inter-label relationships. We identify the most informative samples by aggregating GAT representations. Subsequently, we generate transformations of these informative samples by sampling from a learned latent space. From these generated samples, we identify informative samples via a novel multi-label informativeness score, which beyond the state of the art, ensures that (i) generated samples are not redundant with respect to the training data and (ii) make important contributions to the training stage. We apply our method to two public chest X-ray datasets, as well as breast, dermatology, retina and kidney tissue microscopy MedMNIST datasets, and report improved results over state-of-the-art multi-label AL techniques in terms of model performance, learning rates, and robustness.
在主动学习(AL)环境中进行信息样本选择有助于机器学习系统在使用最少标注样本的情况下达到最佳性能,从而降低标注成本,并在有限的标注数据情况下提高计算机辅助诊断系统的性能。另一种在小标注数据环境中扩充数据集的有效技术是数据增强。因此,一种直观的主动学习方法是将信息样本选择和数据增强相结合,以利用它们各自的优势并提高 AL 系统的性能。在本文中,我们提出了一种名为 GANDALF(基于图的 Transformer 和数据增强主动学习框架)的新方法,用于在多标签环境中结合样本选择和数据增强。传统的 AL 中的样本选择方法主要集中在单标签设置上,其中一个样本只有一个疾病标签。当一个样本可以有多个疾病标签(例如,在胸部 X 射线图像中)时,这些方法不能达到最佳性能。我们通过将疾病标签表示为图节点,并使用图注意力转换器(GAT)来学习更有效的标签间关系,从而改进了最先进的多标签主动学习技术。我们通过聚合 GAT 表示来识别最有信息的样本。然后,我们通过从学习的潜在空间中采样来生成这些有信息样本的变换。从这些生成的样本中,我们通过一种新的多标签信息量得分来识别有信息的样本,该得分不仅超越了现有技术,还确保了(i)生成的样本相对于训练数据不是冗余的,并且(ii)对训练阶段做出了重要贡献。我们将我们的方法应用于两个公共的胸部 X 射线数据集,以及乳腺、皮肤科、视网膜和肾脏组织显微镜 MedMNIST 数据集,并报告了在模型性能、学习率和鲁棒性方面优于最先进的多标签 AL 技术的结果。