科恩氏 κ系数与格瓦特氏 AC1 系数在计算评定者间信度系数时的比较：一项对人格障碍样本进行的研究。

A comparison of Cohen's Kappa and Gwet's AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples.

机构信息

Department of Psychiatry, Faculty of Medicine, Chiang Mai University, Chiang Mai 50200, Thailand.

出版信息

BMC Med Res Methodol. 2013 Apr 29;13:61. doi: 10.1186/1471-2288-13-61.

DOI:10.1186/1471-2288-13-61

PMID:23627889

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3643869/

Abstract

BACKGROUND

Rater agreement is important in clinical research, and Cohen's Kappa is a widely used method for assessing inter-rater reliability; however, there are well documented statistical problems associated with the measure. In order to assess its utility, we evaluated it against Gwet's AC1 and compared the results.

METHODS

This study was carried out across 67 patients (56% males) aged 18 to 67, with a mean SD of 44.13 ± 12.68 years. Nine raters (7 psychiatrists, a psychiatry resident and a social worker) participated as interviewers, either for the first or the second interviews, which were held 4 to 6 weeks apart. The interviews were held in order to establish a personality disorder (PD) diagnosis using DSM-IV criteria. Cohen's Kappa and Gwet's AC1 were used and the level of agreement between raters was assessed in terms of a simple categorical diagnosis (i.e., the presence or absence of a disorder). Data were also compared with a previous analysis in order to evaluate the effects of trait prevalence.

RESULTS

Gwet's AC1 was shown to have higher inter-rater reliability coefficients for all the PD criteria, ranging from .752 to 1.000, whereas Cohen's Kappa ranged from 0 to 1.00. Cohen's Kappa values were high and close to the percentage of agreement when the prevalence was high, whereas Gwet's AC1 values appeared not to change much with a change in prevalence, but remained close to the percentage of agreement. For example a Schizoid sample revealed a mean Cohen's Kappa of .726 and a Gwet's AC1of .853 , which fell within the different level of agreement according to criteria developed by Landis and Koch, and Altman and Fleiss.

CONCLUSIONS

Based on the different formulae used to calculate the level of chance-corrected agreement, Gwet's AC1 was shown to provide a more stable inter-rater reliability coefficient than Cohen's Kappa. It was also found to be less affected by prevalence and marginal probability than that of Cohen's Kappa, and therefore should be considered for use with inter-rater reliability analysis.

摘要

背景

在临床研究中，评分者间一致性很重要，Cohen's Kappa 是一种广泛用于评估评分者间可靠性的方法；然而，该方法存在许多有文献记录的统计学问题。为了评估其效用，我们将其与 Gwet 的 AC1 进行了对比。

方法

本研究共纳入 67 例患者（男性占 56%），年龄 18 至 67 岁，平均年龄（标准差）为 44.13 ± 12.68 岁。9 名评分者（7 名精神科医生、1 名精神科住院医师和 1 名社会工作者）作为访谈者参与研究，他们分别进行了首次或第二次访谈，两次访谈间隔 4 至 6 周。访谈是根据 DSM-IV 标准进行人格障碍（PD）诊断。使用 Cohen's Kappa 和 Gwet 的 AC1 评估评分者间的一致性，并根据简单的分类诊断（即存在或不存在障碍）评估评分者间的一致性水平。还将数据与之前的分析进行了比较，以评估特征患病率的影响。

结果

对于所有 PD 标准，Gwet 的 AC1 显示出更高的评分者间可靠性系数，范围从 0.752 到 1.000，而 Cohen's Kappa 范围从 0 到 1.00。当患病率较高时，Cohen's Kappa 值较高且接近一致性百分比，而 Gwet 的 AC1 值似乎不会随患病率的变化而变化，但仍接近一致性百分比。例如，一个分裂样样本显示 Cohen's Kappa 的平均值为 0.726，Gwet 的 AC1 为 0.853，根据 Landis 和 Koch 以及 Altman 和 Fleiss 制定的标准，这两个值都处于不同的一致性水平。

结论

基于计算机会校正一致性水平的不同公式，Gwet 的 AC1 显示出比 Cohen's Kappa 更稳定的评分者间可靠性系数。与 Cohen's Kappa 相比，它受患病率和边缘概率的影响更小，因此应考虑用于评分者间可靠性分析。

相似文献

A comparison of Cohen's Kappa and Gwet's AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples.科恩氏 κ系数与格瓦特氏 AC1 系数在计算评定者间信度系数时的比较：一项对人格障碍样本进行的研究。

BMC Med Res Methodol. 2013 Apr 29;13:61. doi: 10.1186/1471-2288-13-61.

Gwet's AC1 is not a substitute for Cohen's kappa - A comparison of basic properties.格韦特AC1不能替代科恩kappa系数——基本特性比较

MethodsX. 2023 May 10;10:102212. doi: 10.1016/j.mex.2023.102212. eCollection 2023.

Influence of true within-herd prevalence of small ruminant lentivirus infection in goats on agreement between serological immunoenzymatic tests.山羊小反刍兽慢病毒感染的真实群体内流行率对血清学免疫酶试验之间一致性的影响

Prev Vet Med. 2017 Sep 1;144:75-80. doi: 10.1016/j.prevetmed.2017.05.017. Epub 2017 May 30.

Inter-observer agreement between two observers for bovine digital dermatitis identification in New Zealand using digital photographs.新西兰两名观察者之间使用数码照片识别牛趾间皮炎的观察者间一致性。

N Z Vet J. 2019 May;67(3):143-147. doi: 10.1080/00480169.2019.1582369. Epub 2019 Mar 7.

High Agreement and High Prevalence: The Paradox of Cohen's Kappa.高一致性与高患病率：科恩kappa系数的悖论

Open Nurs J. 2017 Oct 31;11:211-218. doi: 10.2174/1874434601711010211. eCollection 2017.

Homogeneity score test of AC statistics and estimation of common AC in multiple or stratified inter-rater agreement studies.多或分层组内一致性研究中 AC 统计量的同质性检验和共同 AC 的估计。

BMC Med Res Methodol. 2020 Feb 5;20(1):20. doi: 10.1186/s12874-019-0887-5.

Degenerative findings in lumbar spine MRI: an inter-rater reliability study involving three raters.腰椎磁共振成像中的退行性病变：涉及 3 名评估者的观察者间可靠性研究。

Chiropr Man Therap. 2020 Feb 11;28(1):8. doi: 10.1186/s12998-020-0297-0.

Quantifying Interrater Agreement and Reliability Between Thoracic Pathologists: Paradoxical Behavior of Cohen's Kappa in the Presence of a High Prevalence of the Histopathologic Feature in Lung Cancer.量化胸科病理学家之间的评分者间一致性和可靠性：肺癌组织病理学特征高患病率情况下科恩kappa系数的矛盾行为

JTO Clin Res Rep. 2023 Dec 16;5(1):100618. doi: 10.1016/j.jtocrr.2023.100618. eCollection 2024 Jan.

The Flexor Pollicis Longus Reflex: Interrater and Intrarater Reliability in Comparison With Established Muscle Stretch Reflexes.拇长屈肌反射：与既定肌肉牵张反射相比的评分者间和评分者内信度

Am J Phys Med Rehabil. 2021 Jun 1;100(6):539-545. doi: 10.1097/PHM.0000000000001731.

Evaluation of inter-rater agreement of the clinical signs used to diagnose bovine respiratory disease in individually housed veal calves.评估单独饲养小牛肉牛呼吸疾病临床症状的诊断者间一致性。

J Dairy Sci. 2021 Nov;104(11):12053-12065. doi: 10.3168/jds.2021-20503. Epub 2021 Aug 26.

引用本文的文献

Comprehensive Evaluation of Facet Joints Osteoarthritis Radiological Features on Lumbar CT: A Multitask Deep Learning Approach.腰椎CT上小关节骨关节炎放射学特征的综合评估：一种多任务深度学习方法。

JOR Spine. 2025 Sep 11;8(3):e70115. doi: 10.1002/jsp2.70115. eCollection 2025 Sep.

Development and preliminary inter-rater reliability of the new PROOF tool to measure fidelity of problem-solving therapy for depression delivered by non-specialists in a low-resource African setting.新型PROOF工具的开发及评分者间的初步信度研究，该工具用于衡量在资源匮乏的非洲环境中由非专业人员提供的抑郁症问题解决疗法的保真度。

Glob Ment Health (Camb). 2025 Jul 8;12:e98. doi: 10.1017/gmh.2025.10034. eCollection 2025.

Comparison of CTA and DSA collateral scores in predicting clinical outcome in anterior circulation stroke patients receiving endovascular treatment: A retrospective observational study.CTA与DSA侧支循环评分在预测接受血管内治疗的前循环卒中患者临床结局中的比较：一项回顾性观察研究。

Medicine (Baltimore). 2025 Sep 5;104(36):e44422. doi: 10.1097/MD.0000000000044422.

Improving Intensive Care Unit Nurses' Delirium Assessment Performance Through a Multimodal Educational Intervention.通过多模式教育干预提高重症监护病房护士的谵妄评估能力

Nurs Crit Care. 2025 Sep;30(5):e70168. doi: 10.1111/nicc.70168.

Scalable Scientific Interest Profiling Using Large Language Models.使用大语言模型进行可扩展的科学兴趣剖析

ArXiv. 2025 Aug 19:arXiv:2508.15834v1.

Stage-specific harm in early-stage hospitals in South Korea: a retrospective analysis using the healthcare complaints analysis tool.韩国早期医院特定阶段的危害：使用医疗投诉分析工具的回顾性分析

BMJ Open. 2025 Aug 24;15(8):e102802. doi: 10.1136/bmjopen-2025-102802.

Validation of an MRI-based classification of peroneus brevis tendon morphology: a four-type system with high inter-rater reliability.基于磁共振成像的腓骨短肌腱形态学分类的验证：一种具有高评分者间可靠性的四型系统。

Skeletal Radiol. 2025 Aug 13. doi: 10.1007/s00256-025-05010-4.

Natural language processing reveals network structure of pain communication in social media using discrete mathematical analysis.自然语言处理通过离散数学分析揭示了社交媒体中疼痛交流的网络结构。

Sci Rep. 2025 Aug 9;15(1):29219. doi: 10.1038/s41598-025-14680-y.

Impact of a Prospective Simulation-Based Mastery Learning With Deliberate Practice Intervention on Neonatal Intubation.基于前瞻性模拟的刻意练习干预的掌握学习对新生儿插管的影响。

Cureus. 2025 Jul 3;17(7):e87239. doi: 10.7759/cureus.87239. eCollection 2025 Jul.

Reproducibility and accuracy of non-contrast abbreviated magnetic resonance imaging of the liver in surveillance for early recurrence for hepatocellular carcinoma in a Western population: a multi-reader study.西方人群中用于肝细胞癌早期复发监测的肝脏非增强简化磁共振成像的可重复性和准确性：一项多阅片者研究

Abdom Radiol (NY). 2025 Jul 9. doi: 10.1007/s00261-025-05105-5.

本文引用的文献

Screening for personality disorder in incarcerated adolescent boys: preliminary validation of an adolescent version of the standardised assessment of personality - abbreviated scale (SAPAS-AV).对被监禁的青少年男孩进行人格障碍筛查：标准化人格评估-简式量表（SAPAS-AV）青少年版的初步验证。

BMC Psychiatry. 2012 Jul 30;12:94. doi: 10.1186/1471-244X-12-94.

Interrater agreement of nasal endoscopy in patients with a prior history of endoscopic sinus surgery.内镜鼻窦手术后患者鼻内镜检查的观察者间一致性。

Int Forum Allergy Rhinol. 2012 Nov;2(6):453-9. doi: 10.1002/alr.21058. Epub 2012 Jun 13.

Interrater agreement and interrater reliability: key concepts, approaches, and applications.评定者间一致性和评定者间信度：关键概念、方法和应用。

Res Social Adm Pharm. 2013 May-Jun;9(3):330-8. doi: 10.1016/j.sapharm.2012.04.004. Epub 2012 Jun 12.

Development and psychometric properties the Barriers to Access to Care Evaluation scale (BACE) related to people with mental ill health.《与精神健康问题人群相关的获得医疗服务障碍评估量表（BACE）的开发与心理测量学特性》

BMC Psychiatry. 2012 Jun 20;12:36. doi: 10.1186/1471-244X-12-36.

Interrater reliability of Thai version of the Structured Clinical Interview for DSM-IV Axis II Personality Disorders (T-SCID II).《精神疾病诊断与统计手册第四版》轴II人格障碍结构化临床访谈泰语版（T-SCID II）的评分者间信度。

J Med Assoc Thai. 2012 Feb;95(2):264-9.

Development of the Persian version of the Modified Modified Ashworth Scale: translation, adaptation, and examination of interrater and intrarater reliability in patients with poststroke elbow flexor spasticity.改良 Ashworth 量表修订版波斯语版的研制：翻译、改编以及脑卒中后肘屈肌痉挛患者评定者间和评定者内信度的检验。

Disabil Rehabil. 2012;34(21):1843-7. doi: 10.3109/09638288.2012.665133. Epub 2012 Mar 21.

Classification of bipolar disorder in psychiatric hospital. A prospective cohort study.精神科医院的双相情感障碍分类。一项前瞻性队列研究。

BMC Psychiatry. 2012 Feb 29;12:13. doi: 10.1186/1471-244X-12-13.

Diagnostic accuracy and reliability of ultrasonography for the detection of fatty liver: a meta-analysis.超声检查诊断脂肪肝的准确性和可靠性：一项荟萃分析。

Hepatology. 2011 Sep 2;54(3):1082-1090. doi: 10.1002/hep.24452.

Testing a tool for the classification of study designs in systematic reviews of interventions and exposures showed moderate reliability and low accuracy.测试一种用于干预和暴露系统评价中研究设计分类的工具，结果显示该工具具有中等可靠性和低准确性。

J Clin Epidemiol. 2011 Aug;64(8):861-71. doi: 10.1016/j.jclinepi.2011.01.010. Epub 2011 Apr 30.

Frequency, types and severity of medication use-related problems among medical outpatients in Nigeria.尼日利亚门诊患者用药相关问题的发生频率、类型和严重程度。

Int J Clin Pharm. 2011 Jun;33(3):558-64. doi: 10.1007/s11096-011-9508-z. Epub 2011 Apr 28.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验