Suppr超能文献

对美国睡眠医学学会(AASM)评分者间可靠性黄金标准多导睡眠图数据集的自动分析。

Automated analysis of the AASM Inter-Scorer Reliability gold standard polysomnogram dataset.

作者信息

Tripathi Ayush, Nasiri Samaneh, Ganglberger Wolfgang, Nassi Thijs, Meulenbrugge Erik-Jan, Sun Haoqi, Stone Katie L, Mignot Emmanuel, Hwang Dennis, Trotti Lynn Marie, Reyna Matthew A, Clifford Gari D, Katwa Umakanth, Thomas Robert J, Westover M Brandon

机构信息

Department of Neurology, Beth Israel Deaconess Medical Center, Boston, MA.

Harvard Medical School, Boston, MA.

出版信息

J Clin Sleep Med. 2025 Aug 12. doi: 10.5664/jcsm.11848.

Abstract

STUDY OBJECTIVES

To compare the performance of a comprehensive automated polysomnogram (PSG) analysis algorithm-CAISR (Complete Artificial Intelligence Sleep Report)-to a multi-expert gold standard panel, crowdsourced scorers, and experienced technicians for sleep staging and detecting arousals, respiratory events, and limb movements.

METHODS

A benchmark dataset of 57 PSG records (Inter-Scorer Reliability dataset) with 200 30-second epochs scored per AASM guidelines was used. Annotations were obtained from (1) the AASM multi-expert gold standard panel, (2) AASM Inter-Scorer Reliability (ISR) platform users ("crowd," averaging 6,818 raters per epoch), (3) three experienced technicians, and (4) CAISR. Agreement was assessed via Cohen's Kappa (κ) and percent agreement.

RESULTS

Across tasks, CAISR achieved performance comparable to experienced technicians but did not match consensus-level agreement between the multi-expert gold standard and the crowd. For sleep staging, CAISR's agreement with multi-expert gold standard was 82.1% (κ = 0.70), comparable to experienced technicians but below the crowd (κ = 0.88). Arousal detection showed 87.81% agreement (κ = 0.45), respiratory event detection 83.18% (κ = 0.34), and limb movement detection 94.89% (κ = 0.11), each aligning with performance equivalent to experienced technicians but trailing crowd agreement (κ = 0.83, 0.78 and 0.86 for detection of arousal, respiratory events and limb movements respectively).

CONCLUSIONS

CAISR achieves experienced technician-level accuracy for PSG scoring tasks but does not surpass the consensus-level agreement of a multi-expert gold standard or the crowd. These findings highlight the potential of automated scoring to match experienced technician-level performance while emphasizing the value of multi-rater consensus.

摘要

研究目的

将一种全面的自动多导睡眠图(PSG)分析算法——CAISR(完整人工智能睡眠报告)——与多专家金标准小组、众包评分者以及经验丰富的技术人员在睡眠分期、觉醒检测、呼吸事件和肢体运动检测方面的表现进行比较。

方法

使用了一个包含57份PSG记录的基准数据集(评分者间可靠性数据集),按照美国睡眠医学学会(AASM)指南,每份记录有200个30秒的时段进行评分。注释来自于:(1)AASM多专家金标准小组;(2)AASM评分者间可靠性(ISR)平台用户(“众包”,每个时段平均有6818名评分者);(3)三名经验丰富的技术人员;(4)CAISR。通过科恩卡方系数(κ)和一致百分比来评估一致性。

结果

在各项任务中,CAISR的表现与经验丰富的技术人员相当,但未达到多专家金标准与众包评分者之间的共识水平一致性。对于睡眠分期,CAISR与多专家金标准的一致性为82.1%(κ = 0.70),与经验丰富的技术人员相当,但低于众包评分者(κ = 0.88)。觉醒检测的一致性为87.81%(κ = 0.45),呼吸事件检测为83.18%(κ = 0.34),肢体运动检测为94.89%(κ = 0.11),各项表现均与经验丰富的技术人员相当,但落后于众包评分者的一致性(觉醒、呼吸事件和肢体运动检测的κ分别为0.83、0.78和0.86)。

结论

CAISR在PSG评分任务中达到了经验丰富的技术人员水平的准确性,但未超过多专家金标准或众包评分者的共识水平一致性。这些发现凸显了自动评分在匹配经验丰富的技术人员水平表现方面的潜力,同时强调了多评分者共识的价值。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验