Napravnik Mateja, Hržić Franko, Tschauner Sebastian, Štajduhar Ivan
Faculty of Engineering, University of Rijeka, Vukovarska 58, Rijeka, 51000, Croatia.
Center for Artificial Intelligence and Cybersecurity, Radmile Matejcic 2, Rijeka, 51000, Croatia.
BioData Min. 2024 Jul 12;17(1):22. doi: 10.1186/s13040-024-00373-1.
The use of machine learning in medical diagnosis and treatment has grown significantly in recent years with the development of computer-aided diagnosis systems, often based on annotated medical radiology images. However, the lack of large annotated image datasets remains a major obstacle, as the annotation process is time-consuming and costly. This study aims to overcome this challenge by proposing an automated method for annotating a large database of medical radiology images based on their semantic similarity.
An automated, unsupervised approach is used to create a large annotated dataset of medical radiology images originating from the Clinical Hospital Centre Rijeka, Croatia. The pipeline is built by data-mining three different types of medical data: images, DICOM metadata and narrative diagnoses. The optimal feature extractors are then integrated into a multimodal representation, which is then clustered to create an automated pipeline for labelling a precursor dataset of 1,337,926 medical images into 50 clusters of visually similar images. The quality of the clusters is assessed by examining their homogeneity and mutual information, taking into account the anatomical region and modality representation.
The results indicate that fusing the embeddings of all three data sources together provides the best results for the task of unsupervised clustering of large-scale medical data and leads to the most concise clusters. Hence, this work marks the initial step towards building a much larger and more fine-grained annotated dataset of medical radiology images.
近年来,随着计算机辅助诊断系统的发展,机器学习在医学诊断和治疗中的应用显著增加,这些系统通常基于带注释的医学放射图像。然而,由于注释过程既耗时又昂贵,缺乏大型带注释图像数据集仍然是一个主要障碍。本研究旨在通过提出一种基于语义相似性对大型医学放射图像数据库进行注释的自动化方法来克服这一挑战。
采用一种自动化的无监督方法创建了一个来自克罗地亚里耶卡临床医院中心的大型带注释医学放射图像数据集。该流程通过挖掘三种不同类型的医学数据构建:图像、DICOM元数据和叙述性诊断。然后将最佳特征提取器集成到多模态表示中,接着进行聚类,以创建一个自动化流程,将1337926张医学图像的前体数据集标记为50个视觉上相似图像的聚类。通过检查聚类的同质性和互信息来评估聚类质量,同时考虑解剖区域和模态表示。
结果表明,将所有三个数据源的嵌入融合在一起,对于大规模医学数据的无监督聚类任务能提供最佳结果,并能得到最简洁的聚类。因此,这项工作标志着朝着构建一个更大、更细粒度的医学放射图像带注释数据集迈出了第一步。