Suppr超能文献

用于临床机器学习的多中心前列腺多参数MRI数据集中的自动序列识别

Automatic sequence identification in multicentric prostate multiparametric MRI datasets for clinical machine-learning.

作者信息

de Almeida José Guilherme, Verde Ana Sofia Castro, Bilreiro Carlos, Santiago Inês, Ip Joana, Tsiknakis Manolis, Marias Kostas, Regge Daniele, Matos Celso, Papanikolaou Nickolas

机构信息

Champalimaud Foundation, Lisbon, Portugal.

Champalimaud Clinical Center, Lisbon, Portugal.

出版信息

Insights Imaging. 2025 Mar 27;16(1):75. doi: 10.1186/s13244-025-01938-2.

Abstract

OBJECTIVES

To present an accurate machine-learning (ML) method and knowledge-based heuristics for automatic sequence-type identification in multi-centric multiparametric MRI (mpMRI) datasets for prostate cancer (PCa) ML.

METHODS

Retrospective prostate mpMRI studies were classified into 5 series types-T2-weighted (T2W), diffusion-weighted images (DWI), apparent diffusion coefficients (ADC), dynamic contrast-enhanced (DCE) and other series types (others). Metadata was processed for all series and two models were trained (XGBoost after custom categorical tokenization and CatBoost with raw categorical data) using 5-fold cross-validation (CV) with different data fractions for learning curve analyses. For validation, two test sets-hold-out test set and temporal split-were used. A leave-one-group-out (LOGO) CV analysis was performed with centres as groups to understand the effect of dataset-specific data.

RESULTS

4045 studies (31,053 series) and 1004 studies (7891 series) from 11 centres were used to train and test series identification models, respectively. Test F1-scores were consistently above 0.95 (CatBoost) and 0.97 (XGBoost). Learning curves demonstrate learning saturation, while temporal validation shows model remain capable of correctly identifying all T2W/DWI/ADC triplets. However, optimal performance requires centre-specific data-controlling for model and used feature sets when comparing CV with LOGOCV, F1-score dropped for T2W, DCE and others (-0.146, -0.181 and -0.179, respectively), with larger performance decreases for CatBoost (-0.265). Finally, we delineate heuristics to assist researchers in series classification for PCa mpMRI datasets.

CONCLUSIONS

Automatic series-type identification is feasible and can enable automated data curation. However, dataset-specific data should be included to achieve optimal performance.

CRITICAL RELEVANCE STATEMENT

Organising large collections of data is time-consuming but necessary to train clinical machine-learning models. To address this, we outline and validate an automatic series identification method that can facilitate this process. Finally, we outline a set of metadata-based heuristics that can be used to further automate series-type identification.

KEY POINTS

Multi-centric prostate MRI studies were used for sequence annotation model training. Automatic sequence annotation requires few instances and generalises temporally. Sequence annotation, necessary for clinical AI model training, can be performed automatically.

摘要

目的

提出一种准确的机器学习(ML)方法和基于知识的启发式方法,用于在多中心多参数磁共振成像(mpMRI)前列腺癌(PCa)数据集的ML中自动识别序列类型。

方法

回顾性前列腺mpMRI研究被分类为5种序列类型——T2加权(T2W)、扩散加权成像(DWI)、表观扩散系数(ADC)、动态对比增强(DCE)和其他序列类型(其他)。对所有序列的元数据进行处理,并使用5折交叉验证(CV)和不同的数据比例训练两个模型(自定义分类令牌化后的XGBoost和使用原始分类数据的CatBoost)以进行学习曲线分析。为进行验证,使用了两个测试集——留出测试集和时间分割。以中心为组进行留一组出(LOGO)CV分析,以了解特定数据集数据的影响。

结果

分别使用来自11个中心的4045项研究(31,053个序列)和1004项研究(7891个序列)来训练和测试序列识别模型。测试F1分数始终高于0.95(CatBoost)和0.97(XGBoost)。学习曲线显示出学习饱和,而时间验证表明模型仍能够正确识别所有T2W/DWI/ADC三联体。然而,在将CV与LOGO CV进行比较时,要实现最佳性能需要特定于中心的数据来控制模型和使用的特征集,T2W、DCE和其他序列类型的F1分数下降(分别为-0.146、-0.181和-0.179),CatBoost的性能下降幅度更大(-0.265)。最后,我们描述了启发式方法以协助研究人员对PCa mpMRI数据集进行序列分类。

结论

自动序列类型识别是可行的,并且可以实现自动化数据管理。然而,应包含特定数据集的数据以实现最佳性能。

关键相关性声明

整理大量数据既耗时又对训练临床机器学习模型必不可少。为解决这一问题,我们概述并验证了一种可促进此过程的自动序列识别方法。最后,我们概述了一组基于元数据的启发式方法,可用于进一步自动化序列类型识别。

要点

多中心前列腺MRI研究用于序列注释模型训练。自动序列注释所需实例少且具有时间泛化性。临床AI模型训练所需的序列注释可以自动执行。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验