Doran Simon J, Barfoot Theo, Wedlake Linda, Winfield Jessica M, Petts James, Glocker Ben, Li Xingfeng, Leach Martin, Kaiser Martin, Barwick Tara D, Chaidos Aristeidis, Satchwell Laura, Soneji Neil, Elgendy Khalil, Sheeka Alexander, Wallitt Kathryn, Koh Dow-Mu, Messiou Christina, Rockall Andrea
Division of Radiotherapy and Imaging, The Institute of Cancer Research, London, UK.
National Cancer Imaging Translational Accelerator, London, UK.
Insights Imaging. 2024 Feb 16;15(1):47. doi: 10.1186/s13244-023-01591-7.
MAchine Learning In MyelomA Response (MALIMAR) is an observational clinical study combining "real-world" and clinical trial data, both retrospective and prospective. Images were acquired on three MRI scanners over a 10-year window at two institutions, leading to a need for extensive curation.
Curation involved image aggregation, pseudonymisation, allocation between project phases, data cleaning, upload to an XNAT repository visible from multiple sites, annotation, incorporation of machine learning research outputs and quality assurance using programmatic methods.
A total of 796 whole-body MR imaging sessions from 462 subjects were curated. A major change in scan protocol part way through the retrospective window meant that approximately 30% of available imaging sessions had properties that differed significantly from the remainder of the data. Issues were found with a vendor-supplied clinical algorithm for "composing" whole-body images from multiple imaging stations. Historic weaknesses in a digital video disk (DVD) research archive (already addressed by the mid-2010s) were highlighted by incomplete datasets, some of which could not be completely recovered. The final dataset contained 736 imaging sessions for 432 subjects. Software was written to clean and harmonise data. Implications for the subsequent machine learning activity are considered.
MALIMAR exemplifies the vital role that curation plays in machine learning studies that use real-world data. A research repository such as XNAT facilitates day-to-day management, ensures robustness and consistency and enhances the value of the final dataset. The types of process described here will be vital for future large-scale multi-institutional and multi-national imaging projects.
This article showcases innovative data curation methods using a state-of-the-art image repository platform; such tools will be vital for managing the large multi-institutional datasets required to train and validate generalisable ML algorithms and future foundation models in medical imaging.
• Heterogeneous data in the MALIMAR study required the development of novel curation strategies. • Correction of multiple problems affecting the real-world data was successful, but implications for machine learning are still being evaluated. • Modern image repositories have rich application programming interfaces enabling data enrichment and programmatic QA, making them much more than simple "image marts".
骨髓瘤反应中的机器学习(MALIMAR)是一项观察性临床研究,结合了“真实世界”数据与临床试验数据,包括回顾性数据和前瞻性数据。在10年的时间跨度内,于两家机构通过三台MRI扫描仪采集图像,这就需要进行大量的数据整理工作。
数据整理工作包括图像汇总、匿名化处理、在项目阶段之间进行分配、数据清理、上传至可从多个站点访问的XNAT存储库、注释、纳入机器学习研究成果以及使用编程方法进行质量保证。
共整理了来自462名受试者的796次全身MR成像检查。回顾性研究阶段中途扫描协议发生了重大变化,这意味着约30%的可用成像检查具有与其余数据显著不同的特性。发现了一种由供应商提供的用于从多个成像站“合成”全身图像的临床算法存在问题。数字视频磁盘(DVD)研究存档中的历史缺陷(到2010年代中期已得到解决)因数据集不完整而凸显出来,其中一些数据集无法完全恢复。最终数据集包含432名受试者的736次成像检查。编写了软件来清理和协调数据。并考虑了对后续机器学习活动的影响。
MALIMAR例证了数据整理在使用真实世界数据的机器学习研究中所起的关键作用。诸如XNAT这样的研究存储库便于日常管理,确保数据的稳健性和一致性,并提升最终数据集的价值。这里描述的这些流程类型对于未来大规模的多机构和跨国成像项目至关重要。
本文展示了使用先进图像存储库平台的创新数据整理方法;此类工具对于管理训练和验证医学成像中通用机器学习算法及未来基础模型所需的大型多机构数据集至关重要。
• MALIMAR研究中的异构数据需要开发新颖的数据整理策略。• 对影响真实世界数据的多个问题的纠正取得了成功,但对机器学习的影响仍在评估中。• 现代图像存储库具有丰富的应用程序编程接口,可实现数据丰富和编程质量保证,使其远不止是简单的“图像集市”。