Zhou Thomas J, Raza Sughra, Nelson Kerrie P
Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA.
Department of Radiology, Brigham and Women's Hospital, Boston, MA, USA.
J Appl Stat. 2021;48(10):1861-1881. doi: 10.1080/02664763.2020.1777394. Epub 2020 Jun 9.
Advances in breast imaging and other screening tests have prompted studies to evaluate and compare the consistency between experts' ratings of existing with new screening tests. In clinical settings, medical experts make subjective assessments of screening test results such as mammograms. Consistency between experts' ratings is evaluated by measures of inter-rater agreement or association. However, conventional measures, such as Cohen's and Fleiss' kappas, are unable to be applied or may perform poorly when studies consist of many experts, unbalanced data, or dependencies between experts' ratings exist. Here we assess the performance of existing approaches including recently developed summary measures for assessing the agreement between experts' binary and ordinal ratings when patients undergo two screening procedures. Methods to assess consistency between repeated measurements by the same experts are also described. We present applications to three large-scale clinical screening studies. Properties of these agreement measures are illustrated via simulation studies. Generally, a model-based approach provides several advantages over alternative methods including the ability to flexibly incorporate various measurement scales (i.e. binary or ordinal), large numbers of experts and patients, sparse data, and robustness to prevalence of underlying disease.
乳腺成像及其他筛查测试的进展促使了一些研究,以评估和比较专家对现有筛查测试与新筛查测试的评级之间的一致性。在临床环境中,医学专家会对诸如乳房X光检查等筛查测试结果进行主观评估。专家评级之间的一致性通过评分者间一致性或关联性的测量方法来评估。然而,当研究涉及众多专家、数据不均衡或专家评级之间存在相关性时,传统的测量方法,如科恩(Cohen)kappa系数和弗莱斯(Fleiss)kappa系数,可能无法应用或表现不佳。在此,我们评估现有方法的性能,包括最近开发的用于评估患者接受两种筛查程序时专家二元和有序评级之间一致性的汇总测量方法。还描述了评估同一专家重复测量之间一致性的方法。我们展示了在三项大规模临床筛查研究中的应用。通过模拟研究说明了这些一致性测量方法的特性。一般来说,基于模型的方法相对于其他方法具有多个优势,包括能够灵活纳入各种测量尺度(即二元或有序)、大量专家和患者、稀疏数据以及对基础疾病患病率的稳健性。