Nasiri Samaneh, Ganglberger Wolfgang, Nassi Thijs, Meulenbrugge Erik-Jan, Moura Junior Valdery, Ghanta Manohar, Gupta Aditya, Stone Katie L, Kjaer Magnus Ruud, Sum-Ping Oliver, Mignot Emmanuel, Hwang Dennis, Trotti Lynn Marie, Clifford Gari D, Katwa Umakanth, Sun Haoqi, Thomas Robert J, Westover M Brandon
Department of Neurology, Beth Israel Deaconess Medical Center, Boston, MA, USA.
Department of Neurology, Massachusetts General Hospital, Boston, MA, USA.
Sleep. 2025 Jun 24. doi: 10.1093/sleep/zsaf134.
To develop and validate a Complete Artificial Intelligence Sleep Report system (CAISR), a system for comprehensive automated sleep analysis, including sleep staging, arousal detection, apnea identification, and limb movement analysis.
We utilized a large diverse dataset from four cohorts (MGH, MESA, MrOS, SHHS) comprising 25,749 participants to develop CAISR. Following American Academy of Sleep Medicine (AASM) guidelines, CAISR performs four tasks: it stages sleep into five categories (Wake, NREM 1, NREM 2, NREM 3, REM), detects arousals, detects and classifies breathing events (Obstructive Apnea, Central Apnea, Mixed Apnea, Hypopnea, and RERA), and detects limb movements and categorizes them as periodic or isolated. We tested CAISR against multiple datasets independently annotated by multiple experts, including UPenn (69 subject, 6 experts), BITS (98 subjects, three experts), Stanford (100 subjects, three experts). Sleep staging and arousal detection were accomplished using customized deep neural networks, while breathing event detection and classification and limb movement analysis were accomplished using rule-based signal processing approaches. We quantified CAISR performance with three metrics: Cohen's Kappa, area under the receiver operating characteristic curve (AUROC), and area under the precision-recall curve (AUPRC). To determine whether CAISR performed on par with human experts, we compared expert inter-rater reliability (IRR) with algorithm-expert IRR.
The CAISR model showed strong overall performance across the four tasks: sleep staging, arousal detection, apnea detection, and limb movement detection. In sleep staging, the model achieved AUROC values ranging from 0.82 to 0.97 and AUPRC values between 0.63 and 0.90 across the BITS, Stanford, and Penn datasets, indicating high classification accuracy. The Kappa agreement analysis showed that in the BITS and Stanford datasets, CAISR outperformed human experts, with non-overlapping confidence intervals indicating superiority (Kappa values around 0.7 to 0.8 for CAISR vs. experts). In the Penn dataset, the model's performance was comparable to experts, with overlapping confidence intervals suggesting non-inferiority. For arousal detection, the model maintained reliable performance, with AUROC values ranging from 0.83 to 0.94 and AUPRC values from 0.67 to 0.85, and Kappa analysis showing overlapping confidence intervals, indicating comparable performance to experts in both the BITS and Stanford datasets (Kappa values for CAISR around 0.6 to 0.75). In apnea detection, including the detection of obstructive, central, and mixed apnea, the CAISR model achieved competitive results in the BITS dataset with AUROC values between 0.81 and 0.95 and AUPRC values between 0.58 and 0.82, but in the Stanford dataset, it underperformed compared to human experts, as shown by non-overlapping confidence intervals and lower Kappa values (around 0.55 to 0.65). Finally, in limb movement detection, the model demonstrated superior performance in the BITS dataset, with AUROC values of 0.9 to 0.96 and AUPRC values between 0.75 and 0.85, and Kappa analysis indicating significantly higher reliability compared to experts (CAISR Kappa around 0.8, with non-overlapping confidence intervals). In the Stanford dataset, CAISR's performance was comparable to experts, with overlapping confidence intervals suggesting non-inferiority (Kappa values around 0.65 to 0.7). Overall, the CAISR model consistently exhibited high classification performance and reliability across tasks, often matching or surpassing expert-level performance, with particularly strong results in sleep staging and limb detection.
The CAISR model demonstrated high classification accuracy and reliability across sleep staging, arousal, apnea, and limb movement detection tasks, matching or surpassing human expert performance. Human errors and systematic biases in the annotation of micro-events during sleep, such as arousal and apnea detection, likely contributed to variability in expert performance, while the CAISR model showed more consistent results, reducing the impact of these biases and increasing overall reliability across task.
开发并验证一个完整的人工智能睡眠报告系统(CAISR),这是一个用于全面自动化睡眠分析的系统,包括睡眠分期、觉醒检测、呼吸暂停识别和肢体运动分析。
我们利用来自四个队列(MGH、MESA、MrOS、SHHS)的大量多样化数据集,该数据集包含25,749名参与者,用于开发CAISR。按照美国睡眠医学学会(AASM)的指南,CAISR执行四项任务:将睡眠分为五类(清醒、NREM 1、NREM 2、NREM 3、快速眼动),检测觉醒,检测并分类呼吸事件(阻塞性呼吸暂停、中枢性呼吸暂停、混合性呼吸暂停、呼吸浅慢和RERA),以及检测肢体运动并将其分类为周期性或孤立性。我们针对由多位专家独立注释的多个数据集对CAISR进行测试,包括宾夕法尼亚大学数据集(69名受试者,6位专家)、比拉理工科学学院数据集(98名受试者,3位专家)、斯坦福大学数据集(100名受试者,3位专家)。睡眠分期和觉醒检测使用定制的深度神经网络完成,而呼吸事件检测与分类以及肢体运动分析则使用基于规则的信号处理方法完成。我们用三个指标量化CAISR的性能:科恩kappa系数、受试者工作特征曲线下面积(AUROC)和精确召回率曲线下面积(AUPRC)。为了确定CAISR的表现是否与人类专家相当,我们将专家之间的评分者信度(IRR)与算法和专家之间的IRR进行了比较。
CAISR模型在四项任务中均表现出强大的整体性能:睡眠分期、觉醒检测、呼吸暂停检测和肢体运动检测。在睡眠分期方面,该模型在比拉理工科学学院、斯坦福大学和宾夕法尼亚大学数据集中的AUROC值范围为0.82至0.97,AUPRC值在0.63至0.90之间,表明分类准确率较高。kappa一致性分析表明,在比拉理工科学学院和斯坦福大学数据集中,CAISR的表现优于人类专家,非重叠置信区间表明其具有优越性(CAISR的kappa值约为0.7至0.8,而专家的kappa值较低)。在宾夕法尼亚大学数据集中,该模型的表现与专家相当,重叠置信区间表明其非劣效性。对于觉醒检测,该模型保持了可靠的性能,AUROC值范围为0.83至0.94,AUPRC值为0.67至0.85,kappa分析显示置信区间重叠,表明在比拉理工科学学院和斯坦福大学数据集中与专家的表现相当(CAISR的kappa值约为0.6至0.75)。在呼吸暂停检测方面,包括阻塞性、中枢性和混合性呼吸暂停的检测,CAISR模型在比拉理工科学学院数据集中取得了具有竞争力的结果,AUROC值在0.81至0.95之间,AUPRC值在0.58至0.82之间,但在斯坦福大学数据集中,与人类专家相比表现较差,非重叠置信区间和较低的kappa值表明了这一点(约为0.55至0.65)。最后,在肢体运动检测方面,该模型在比拉理工科学学院数据集中表现出卓越的性能,AUROC值为0.9至0.96,AUPRC值在0.75至0.85之间,kappa分析表明与专家相比具有显著更高的可靠性(CAISR的kappa值约为0.8,置信区间不重叠)。在斯坦福大学数据集中,CAISR的表现与专家相当,重叠置信区间表明其非劣效性(kappa值约为0.65至0.7)。总体而言,CAISR模型在各项任务中始终表现出较高的分类性能和可靠性,常常与专家水平相当或超过专家水平,在睡眠分期和肢体检测方面的结果尤为突出。
CAISR模型在睡眠分期、觉醒、呼吸暂停和肢体运动检测任务中表现出较高的分类准确率和可靠性,与人类专家相当或超过人类专家。睡眠期间微事件(如觉醒和呼吸暂停检测)注释中的人为错误和系统偏差可能导致专家表现的差异,而CAISR模型显示出更一致的结果,减少了这些偏差的影响并提高了各项任务的总体可靠性。