Nelson Kerrie P, Mitani Aya A, Edwards Don
Department of Biostatistics, Boston University, 801 Massachusetts Avenue, Boston, MA, 02118, U.S.A.
Department of Statistics, University of South Carolina, Columbia, SC, 29208, U.S.A.
Stat Med. 2017 Sep 10;36(20):3181-3199. doi: 10.1002/sim.7323. Epub 2017 Jun 13.
Widespread inconsistencies are commonly observed between physicians' ordinal classifications in screening tests results such as mammography. These discrepancies have motivated large-scale agreement studies where many raters contribute ratings. The primary goal of these studies is to identify factors related to physicians and patients' test results, which may lead to stronger consistency between raters' classifications. While ordered categorical scales are frequently used to classify screening test results, very few statistical approaches exist to model agreement between multiple raters. Here we develop a flexible and comprehensive approach to assess the influence of rater and subject characteristics on agreement between multiple raters' ordinal classifications in large-scale agreement studies. Our approach is based upon the class of generalized linear mixed models. Novel summary model-based measures are proposed to assess agreement between all, or a subgroup of raters, such as experienced physicians. Hypothesis tests are described to formally identify factors such as physicians' level of experience that play an important role in improving consistency of ratings between raters. We demonstrate how unique characteristics of individual raters can be assessed via conditional modes generated during the modeling process. Simulation studies are presented to demonstrate the performance of the proposed methods and summary measure of agreement. The methods are applied to a large-scale mammography agreement study to investigate the effects of rater and patient characteristics on the strength of agreement between radiologists. Copyright © 2017 John Wiley & Sons, Ltd.
在诸如乳房X光检查等筛查测试结果中,医生的序数分类之间普遍存在不一致性。这些差异促使了大规模的一致性研究,许多评估者参与评分。这些研究的主要目标是确定与医生和患者测试结果相关的因素,这可能会使评估者的分类之间具有更强的一致性。虽然有序分类量表经常用于对筛查测试结果进行分类,但很少有统计方法可用于对多个评估者之间的一致性进行建模。在此,我们开发了一种灵活且全面的方法,以评估评估者和受试者特征对大规模一致性研究中多个评估者序数分类之间一致性的影响。我们的方法基于广义线性混合模型类。提出了基于模型的新颖汇总度量,以评估所有评估者或一部分评估者(如经验丰富的医生)之间的一致性。描述了假设检验,以正式确定诸如医生经验水平等在提高评估者之间评分一致性方面起重要作用的因素。我们展示了如何通过建模过程中生成的条件模式来评估单个评估者的独特特征。进行了模拟研究,以证明所提出方法和一致性汇总度量的性能。这些方法应用于一项大规模乳房X光检查一致性研究,以调查评估者和患者特征对放射科医生之间一致性强度的影响。版权所有© 2017约翰威立父子有限公司。