Keeble Claire, Baxter Paul D, Gislason-Lee Amber J, Treadgold Laura A, Davies Andrew G
1 Division of Epidemiology and Biostatistics, University of Leeds, Leeds, UK.
2 Division of Biomedical Imaging, University of Leeds, Leeds, UK.
Br J Radiol. 2016 Jul;89(1063):20160094. doi: 10.1259/bjr.20160094. Epub 2016 Apr 12.
The assessment of image quality in medical imaging often requires observers to rate images for some metric or detectability task. These subjective results are used in optimization, radiation dose reduction or system comparison studies and may be compared to objective measures from a computer vision algorithm performing the same task. One popular scoring approach is to use a Likert scale, then assign consecutive numbers to the categories. The mean of these response values is then taken and used for comparison with the objective or second subjective response. Agreement is often assessed using correlation coefficients. We highlight a number of weaknesses in this common approach, including inappropriate analyses of ordinal data and the inability to properly account for correlations caused by repeated images or observers. We suggest alternative data collection and analysis techniques such as amendments to the scale and multilevel proportional odds models. We detail the suitability of each approach depending upon the data structure and demonstrate each method using a medical imaging example. Whilst others have raised some of these issues, we evaluated the entire study from data collection to analysis, suggested sources for software and further reading, and provided a checklist plus flowchart for use with any ordinal data. We hope that raised awareness of the limitations of the current approaches will encourage greater method consideration and the utilization of a more appropriate analysis. More accurate comparisons between measures in medical imaging will lead to a more robust contribution to the imaging literature and ultimately improved patient care.
医学成像中图像质量的评估通常要求观察者根据某些指标或可检测性任务对图像进行评分。这些主观结果用于优化、辐射剂量降低或系统比较研究,并可与执行相同任务的计算机视觉算法的客观测量结果进行比较。一种常用的评分方法是使用李克特量表,然后为各个类别分配连续的数字。然后取这些响应值的平均值,并用于与客观或第二个主观响应进行比较。一致性通常使用相关系数进行评估。我们强调了这种常用方法中的一些弱点,包括对有序数据的不当分析以及无法正确考虑由重复图像或观察者引起的相关性。我们建议采用替代的数据收集和分析技术,如对量表的修正和多级比例优势模型。我们根据数据结构详细说明了每种方法的适用性,并使用医学成像示例演示了每种方法。虽然其他人已经提出了其中一些问题,但我们评估了从数据收集到分析的整个研究,推荐了软件来源和进一步阅读资料,并提供了一份检查表和流程图,供处理任何有序数据时使用。我们希望提高对当前方法局限性的认识,将鼓励更多地考虑方法并采用更合适的分析方法。医学成像中测量之间更准确的比较将为成像文献做出更有力的贡献,并最终改善患者护理。