Akinci D'Antonoli Tugba, Cavallo Armando Ugo, Kocak Burak, Borgheresi Alessandra, Ponsiglione Andrea, Stanzione Arnaldo, Koltsakis Emmanouil, Doniselli Fabio Martino, Vernuccio Federica, Ugga Lorenzo, Triantafyllou Matthaios, Huisman Merel, Klontzas Michail E, Trotta Romina, Cannella Roberto, Fanni Salvatore Claudio, Cuocolo Renato
Institute of Radiology and Nuclear Medicine, Cantonal Hospital Baselland, Liestal, Switzerland.
Division of Radiology, Istituto Dermopatico dell'Immacolata (IDI), IRCCS, Rome, Italy.
Eur Radiol. 2025 Feb 19. doi: 10.1007/s00330-025-11443-1.
To investigate the intra- and inter-rater reliability of the total methodological radiomics score (METRICS) and its items through a multi-reader analysis.
A total of 12 raters with different backgrounds and experience levels were recruited for the study. Based on their level of expertise, raters were randomly assigned to the following groups: two inter-rater reliability groups, and two intra-rater reliability groups, where each group included one group with and one group without a preliminary training session on the use of METRICS. Inter-rater reliability groups assessed all 34 papers, while intra-rater reliability groups completed the assessment of 17 papers twice within 21 days each time, and a "wash out" period of 60 days in between.
Inter-rater reliability was poor to moderate between raters of group 1 (without training; ICC = 0.393; 95% CI = 0.115-0.630; p = 0.002), and between raters of group 2 (with training; ICC = 0.433; 95% CI = 0.127-0.671; p = 0.002). The intra-rater analysis was excellent for raters 9 and 12, good to excellent for raters 8 and 10, moderate to excellent for rater 7, and poor to good for rater 11.
The intra-rater reliability of the METRICS score was relatively good, while the inter-rater reliability was relatively low. This highlights the need for further efforts to achieve a common understanding of METRICS items, as well as resources consisting of explanations, elaborations, and examples to improve reproducibility and enhance their usability and robustness.
Questions Guidelines and scoring tools are necessary to improve the quality of radiomics research; however, the application of these tools is challenging for less experienced raters. Findings Intra-rater reliability was high across all raters regardless of experience level or previous training, and inter-rater reliability was generally poor to moderate across raters. Clinical relevance Guidelines and scoring tools are necessary for proper reporting in radiomics research and for closing the gap between research and clinical implementation. There is a need for further resources offering explanations, elaborations, and examples to enhance the usability and robustness of these guidelines.
通过多读者分析,研究总体方法学放射组学评分(METRICS)及其项目在评分者内和评分者间的可靠性。
共招募了12名具有不同背景和经验水平的评分者参与本研究。根据专业水平,评分者被随机分为以下几组:两个评分者间可靠性组和两个评分者内可靠性组,其中每组包括一组接受过和一组未接受过关于METRICS使用的初步培训的人员。评分者间可靠性组评估所有34篇论文,而评分者内可靠性组每次在21天内对17篇论文进行两次评估,中间间隔60天的“洗脱期”。
第1组(未培训;组内相关系数[ICC]=0.393;95%置信区间[CI]=0.115 - 0.630;p=0.002)和第2组(培训后;ICC=0.433;95%CI=0.127 - 0.671;p=0.002)的评分者间可靠性较差至中等。评分者内分析中,评分者9和12的可靠性极佳,评分者8和10的可靠性良好至极佳,评分者7的可靠性中等至极佳,评分者11的可靠性较差至良好。
METRICS评分的评分者内可靠性相对较好,而评分者间可靠性相对较低。这凸显了需要进一步努力以实现对METRICS项目的共同理解,以及提供由解释、阐述和示例组成的资源,以提高可重复性并增强其可用性和稳健性。
问题 指南和评分工具对于提高放射组学研究质量是必要的;然而,对于经验较少的评分者来说,这些工具的应用具有挑战性。研究结果 无论经验水平或先前培训如何,所有评分者的评分者内可靠性都很高,而评分者间可靠性在评分者中总体较差至中等。临床意义 指南和评分工具对于放射组学研究中的正确报告以及缩小研究与临床应用之间的差距是必要的。需要进一步提供解释、阐述和示例的资源,以提高这些指南的可用性和稳健性。