School of Biological Science and Medical Engineering and Research Institute, Beihang University, Shenzhen, China.
Key Laboratory of Biomechanics and Mechanobiology of Ministry of Education, Beihang University, Beijing, 100191, China.
Sleep Breath. 2019 Jun;23(2):719-728. doi: 10.1007/s11325-019-01801-x. Epub 2019 Feb 19.
To determine inter-lab reliability in sleep stage scoring using the 2014 American Academy of Sleep Medicine (AASM) manual. To understand in-depth reasons for disagreement and provide suggestions for improvement.
This study consisted of 40 all-night polysomnographys (PSGs) from different samples. PSGs were segmented into 37,642 30-s epochs. Five doctors from China and two doctors from America scored the epochs following the 2014 AASM standard. Scoring disagreement between two centers was evaluated using Cohen's kappa (κ). After visual inspection of PSGs of deviating scorings, potential disagreement reasons were analyzed.
Inter-lab reliability yielded a substantial degree (κ = 0.75 ± 0.01). Scoring for stage W (κ = 0.89) and R (κ = 0.87) achieved the highest agreement, while stage N1 (κ = 0.45) reflected the lowest. Considering the relative disagreement ratio, N2-N3 (22.09%), W-N1 (19.68%), and N1-N2 (18.75%) were the most frequent combinations of discrepancy. American and Chinese doctors showed certain characteristics in the scoring of discrepancy combination W-N1, N1-N2, and N2-N3. There are seven reasons for disagreement, namely "on-threshold characteristic" (29.21%), "context influence" (18.06%), "characteristic identification difficulty" (8.81%), "arousal-wake confusion" (7.57%), "derivation inconsistence" (2.15%), "on-borderline characteristic" (0.92%), and "misrecognition" (33.27%).
This study demonstrated the sleep stage scoring agreement of the 2014 AASM manual and explored potential sources of labeling ambiguity. Improvement measures were suggested accordingly to help remove ambiguity for scorers and improve scoring reliability at the international level.
使用 2014 年美国睡眠医学学会(AASM)手册确定睡眠分期评分的实验室间可靠性。深入了解分歧的原因,并提出改进建议。
本研究包括来自不同样本的 40 个整夜多导睡眠图(PSG)。PSG 被分割成 37642 个 30 秒的时相。来自中国的 5 位医生和来自美国的 2 位医生按照 2014 年 AASM 标准对时相进行评分。使用 Cohen's kappa(κ)评估两个中心之间的评分分歧。在对偏离评分的 PSG 进行视觉检查后,分析潜在的分歧原因。
实验室间可靠性达到了较高的程度(κ=0.75±0.01)。W 期(κ=0.89)和 R 期(κ=0.87)的评分具有最高的一致性,而 N1 期(κ=0.45)则反映了最低的一致性。考虑到相对分歧率,N2-N3(22.09%)、W-N1(19.68%)和 N1-N2(18.75%)是分歧最常见的组合。美国和中国医生在评分差异组合 W-N1、N1-N2 和 N2-N3 方面表现出一定的特征。分歧的原因有七个,即“阈值特征”(29.21%)、“背景影响”(18.06%)、“特征识别困难”(8.81%)、“觉醒-唤醒混淆”(7.57%)、“不一致推断”(2.15%)、“边界特征”(0.92%)和“误识别”(33.27%)。
本研究表明,2014 年 AASM 手册的睡眠分期评分具有一致性,并探讨了标记不明确的潜在来源。相应地提出了改进措施,以帮助评分者消除歧义,提高国际水平的评分可靠性。