CAISR：在所有临床睡眠指标的自动睡眠分析中实现人类水平的性能。

Nasiri Samaneh, Ganglberger Wolfgang, Nassi Thijs, Meulenbrugge Erik-Jan, Moura Junior Valdery, Ghanta Manohar, Gupta Aditya, Stone Katie L, Kjaer Magnus Ruud, Sum-Ping Oliver, Mignot Emmanuel, Hwang Dennis, Trotti Lynn Marie, Clifford Gari D, Katwa Umakanth, Sun Haoqi, Thomas Robert J, Westover M Brandon

Department of Neurology, Beth Israel Deaconess Medical Center, Boston, MA, USA.

Department of Neurology, Massachusetts General Hospital, Boston, MA, USA.

Sleep. 2025 Jun 24. doi: 10.1093/sleep/zsaf134.

STUDY OBJECTIVES

To develop and validate a Complete Artificial Intelligence Sleep Report system (CAISR), a system for comprehensive automated sleep analysis, including sleep staging, arousal detection, apnea identification, and limb movement analysis.

METHODS

We utilized a large diverse dataset from four cohorts (MGH, MESA, MrOS, SHHS) comprising 25,749 participants to develop CAISR. Following American Academy of Sleep Medicine (AASM) guidelines, CAISR performs four tasks: it stages sleep into five categories (Wake, NREM 1, NREM 2, NREM 3, REM), detects arousals, detects and classifies breathing events (Obstructive Apnea, Central Apnea, Mixed Apnea, Hypopnea, and RERA), and detects limb movements and categorizes them as periodic or isolated. We tested CAISR against multiple datasets independently annotated by multiple experts, including UPenn (69 subject, 6 experts), BITS (98 subjects, three experts), Stanford (100 subjects, three experts). Sleep staging and arousal detection were accomplished using customized deep neural networks, while breathing event detection and classification and limb movement analysis were accomplished using rule-based signal processing approaches. We quantified CAISR performance with three metrics: Cohen's Kappa, area under the receiver operating characteristic curve (AUROC), and area under the precision-recall curve (AUPRC). To determine whether CAISR performed on par with human experts, we compared expert inter-rater reliability (IRR) with algorithm-expert IRR.

RESULTS

The CAISR model showed strong overall performance across the four tasks: sleep staging, arousal detection, apnea detection, and limb movement detection. In sleep staging, the model achieved AUROC values ranging from 0.82 to 0.97 and AUPRC values between 0.63 and 0.90 across the BITS, Stanford, and Penn datasets, indicating high classification accuracy. The Kappa agreement analysis showed that in the BITS and Stanford datasets, CAISR outperformed human experts, with non-overlapping confidence intervals indicating superiority (Kappa values around 0.7 to 0.8 for CAISR vs. experts). In the Penn dataset, the model's performance was comparable to experts, with overlapping confidence intervals suggesting non-inferiority. For arousal detection, the model maintained reliable performance, with AUROC values ranging from 0.83 to 0.94 and AUPRC values from 0.67 to 0.85, and Kappa analysis showing overlapping confidence intervals, indicating comparable performance to experts in both the BITS and Stanford datasets (Kappa values for CAISR around 0.6 to 0.75). In apnea detection, including the detection of obstructive, central, and mixed apnea, the CAISR model achieved competitive results in the BITS dataset with AUROC values between 0.81 and 0.95 and AUPRC values between 0.58 and 0.82, but in the Stanford dataset, it underperformed compared to human experts, as shown by non-overlapping confidence intervals and lower Kappa values (around 0.55 to 0.65). Finally, in limb movement detection, the model demonstrated superior performance in the BITS dataset, with AUROC values of 0.9 to 0.96 and AUPRC values between 0.75 and 0.85, and Kappa analysis indicating significantly higher reliability compared to experts (CAISR Kappa around 0.8, with non-overlapping confidence intervals). In the Stanford dataset, CAISR's performance was comparable to experts, with overlapping confidence intervals suggesting non-inferiority (Kappa values around 0.65 to 0.7). Overall, the CAISR model consistently exhibited high classification performance and reliability across tasks, often matching or surpassing expert-level performance, with particularly strong results in sleep staging and limb detection.

CONCLUSIONS

The CAISR model demonstrated high classification accuracy and reliability across sleep staging, arousal, apnea, and limb movement detection tasks, matching or surpassing human expert performance. Human errors and systematic biases in the annotation of micro-events during sleep, such as arousal and apnea detection, likely contributed to variability in expert performance, while the CAISR model showed more consistent results, reducing the impact of these biases and increasing overall reliability across task.

研究目的

开发并验证一个完整的人工智能睡眠报告系统（CAISR），这是一个用于全面自动化睡眠分析的系统，包括睡眠分期、觉醒检测、呼吸暂停识别和肢体运动分析。

方法

我们利用来自四个队列（MGH、MESA、MrOS、SHHS）的大量多样化数据集，该数据集包含25,749名参与者，用于开发CAISR。按照美国睡眠医学学会（AASM）的指南，CAISR执行四项任务：将睡眠分为五类（清醒、NREM 1、NREM 2、NREM 3、快速眼动），检测觉醒，检测并分类呼吸事件（阻塞性呼吸暂停、中枢性呼吸暂停、混合性呼吸暂停、呼吸浅慢和RERA），以及检测肢体运动并将其分类为周期性或孤立性。我们针对由多位专家独立注释的多个数据集对CAISR进行测试，包括宾夕法尼亚大学数据集（69名受试者，6位专家）、比拉理工科学学院数据集（98名受试者，3位专家）、斯坦福大学数据集（100名受试者，3位专家）。睡眠分期和觉醒检测使用定制的深度神经网络完成，而呼吸事件检测与分类以及肢体运动分析则使用基于规则的信号处理方法完成。我们用三个指标量化CAISR的性能：科恩kappa系数、受试者工作特征曲线下面积（AUROC）和精确召回率曲线下面积（AUPRC）。为了确定CAISR的表现是否与人类专家相当，我们将专家之间的评分者信度（IRR）与算法和专家之间的IRR进行了比较。

结果

CAISR模型在四项任务中均表现出强大的整体性能：睡眠分期、觉醒检测、呼吸暂停检测和肢体运动检测。在睡眠分期方面，该模型在比拉理工科学学院、斯坦福大学和宾夕法尼亚大学数据集中的AUROC值范围为0.82至0.97，AUPRC值在0.63至0.90之间，表明分类准确率较高。kappa一致性分析表明，在比拉理工科学学院和斯坦福大学数据集中，CAISR的表现优于人类专家，非重叠置信区间表明其具有优越性（CAISR的kappa值约为0.7至0.8，而专家的kappa值较低）。在宾夕法尼亚大学数据集中，该模型的表现与专家相当，重叠置信区间表明其非劣效性。对于觉醒检测，该模型保持了可靠的性能，AUROC值范围为0.83至0.94，AUPRC值为0.67至0.85，kappa分析显示置信区间重叠，表明在比拉理工科学学院和斯坦福大学数据集中与专家的表现相当（CAISR的kappa值约为0.6至0.75）。在呼吸暂停检测方面，包括阻塞性、中枢性和混合性呼吸暂停的检测，CAISR模型在比拉理工科学学院数据集中取得了具有竞争力的结果，AUROC值在0.81至0.95之间，AUPRC值在0.58至0.82之间，但在斯坦福大学数据集中，与人类专家相比表现较差，非重叠置信区间和较低的kappa值表明了这一点（约为0.55至0.65）。最后，在肢体运动检测方面，该模型在比拉理工科学学院数据集中表现出卓越的性能，AUROC值为0.9至0.96，AUPRC值在0.75至0.85之间，kappa分析表明与专家相比具有显著更高的可靠性（CAISR的kappa值约为0.8，置信区间不重叠）。在斯坦福大学数据集中，CAISR的表现与专家相当，重叠置信区间表明其非劣效性（kappa值约为0.65至0.7）。总体而言，CAISR模型在各项任务中始终表现出较高的分类性能和可靠性，常常与专家水平相当或超过专家水平，在睡眠分期和肢体检测方面的结果尤为突出。

结论

CAISR模型在睡眠分期、觉醒、呼吸暂停和肢体运动检测任务中表现出较高的分类准确率和可靠性，与人类专家相当或超过人类专家。睡眠期间微事件（如觉醒和呼吸暂停检测）注释中的人为错误和系统偏差可能导致专家表现的差异，而CAISR模型显示出更一致的结果，减少了这些偏差的影响并提高了各项任务的总体可靠性。

相似文献

CAISR: Achieving Human-Level Performance in Automated Sleep Analysis Across All Clinical Sleep Metrics.

Sleep. 2025 Jun 24. doi: 10.1093/sleep/zsaf134.

Automated analysis of the AASM Inter-Scorer Reliability gold standard polysomnogram dataset.

J Clin Sleep Med. 2025 Aug 12. doi: 10.5664/jcsm.11848.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Are Current Survival Prediction Tools Useful When Treating Subsequent Skeletal-related Events From Bone Metastases?

Clin Orthop Relat Res. 2024 Sep 1;482(9):1710-1721. doi: 10.1097/CORR.0000000000003030. Epub 2024 Mar 22.

Expert-Level Detection of Epilepsy Markers in EEG on Short and Long Timescales.

NEJM AI. 2025 Jul;2(7). doi: 10.1056/aioa2401221. Epub 2025 Jun 26.

Refining sleep staging accuracy: transfer learning coupled with scorability models.

Sleep. 2024 Nov 8;47(11). doi: 10.1093/sleep/zsae202.

Comparison of automated deep neural network against manual sleep stage scoring in clinical data.

Comput Biol Med. 2024 Sep;179:108855. doi: 10.1016/j.compbiomed.2024.108855. Epub 2024 Jul 18.

Prescription of Controlled Substances: Benefits and Risks

Leveraging a foundation model zoo for cell similarity search in oncological microscopy across devices.

Front Oncol. 2025 Jun 18;15:1480384. doi: 10.3389/fonc.2025.1480384. eCollection 2025.

Evaluation of automated pediatric sleep stage classification using U-Sleep: a convolutional neural network.

J Clin Sleep Med. 2025 Feb 1;21(2):277-285. doi: 10.5664/jcsm.11362.

引用本文的文献

An important step toward automation of polysomnography analyses.

Sleep. 2025 Aug 14;48(8). doi: 10.1093/sleep/zsaf147.

本文引用的文献

Refining sleep staging accuracy: transfer learning coupled with scorability models.

Sleep. 2024 Nov 8;47(11). doi: 10.1093/sleep/zsae202.

Exploring the Role of Circadian Rhythms in Sleep and Recovery: A Review Article.

Cureus. 2024 Jun 3;16(6):e61568. doi: 10.7759/cureus.61568. eCollection 2024 Jun.

AASM Scoring Manual 3: a step forward for advancing sleep care for patients with obstructive sleep apnea.

J Clin Sleep Med. 2024 May 1;20(5):835-836. doi: 10.5664/jcsm.11040.

Association of Periodic Limb Movements and Obstructive Sleep Apnea With Risk of Cardiovascular Disease and Mortality.

J Am Heart Assoc. 2024 Feb 6;13(3):e031630. doi: 10.1161/JAHA.123.031630. Epub 2024 Jan 19.

Multi-task learning for arousal and sleep stage detection using fully convolutional networks.

J Neural Eng. 2023 Oct 9;20(5). doi: 10.1088/1741-2552/acfe3a.

Adoption of Transformer Neural Network to Improve the Diagnostic Performance of Oximetry for Obstructive Sleep Apnea.

Sensors (Basel). 2023 Sep 15;23(18):7924. doi: 10.3390/s23187924.

Deep learning for obstructive sleep apnea diagnosis based on single channel oximetry.

Nat Commun. 2023 Aug 12;14(1):4881. doi: 10.1038/s41467-023-40604-3.

SViT: A Spectral Vision Transformer for the Detection of REM Sleep Behavior Disorder.

IEEE J Biomed Health Inform. 2023 Sep;27(9):4285-4292. doi: 10.1109/JBHI.2023.3292231. Epub 2023 Sep 6.

Evaluation of consensus sleep stage scoring of dysregulated sleep in Parkinson's disease.

Sleep Med. 2023 Jul;107:236-242. doi: 10.1016/j.sleep.2023.04.031. Epub 2023 May 18.

ProductGraphSleepNet: Sleep staging using product spatio-temporal graph learning with attentive temporal aggregation.

Neural Netw. 2023 Jul;164:667-680. doi: 10.1016/j.neunet.2023.05.016. Epub 2023 May 13.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

CAISR: Achieving Human-Level Performance in Automated Sleep Analysis Across All Clinical Sleep Metrics.

Sleep. 2025 Jun 24. doi: 10.1093/sleep/zsaf134.

Automated analysis of the AASM Inter-Scorer Reliability gold standard polysomnogram dataset.

J Clin Sleep Med. 2025 Aug 12. doi: 10.5664/jcsm.11848.

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Are Current Survival Prediction Tools Useful When Treating Subsequent Skeletal-related Events From Bone Metastases?

Clin Orthop Relat Res. 2024 Sep 1;482(9):1710-1721. doi: 10.1097/CORR.0000000000003030. Epub 2024 Mar 22.

Expert-Level Detection of Epilepsy Markers in EEG on Short and Long Timescales.

NEJM AI. 2025 Jul;2(7). doi: 10.1056/aioa2401221. Epub 2025 Jun 26.

Refining sleep staging accuracy: transfer learning coupled with scorability models.

Sleep. 2024 Nov 8;47(11). doi: 10.1093/sleep/zsae202.

Comparison of automated deep neural network against manual sleep stage scoring in clinical data.

Comput Biol Med. 2024 Sep;179:108855. doi: 10.1016/j.compbiomed.2024.108855. Epub 2024 Jul 18.

Prescription of Controlled Substances: Benefits and Risks

Leveraging a foundation model zoo for cell similarity search in oncological microscopy across devices.

Front Oncol. 2025 Jun 18;15:1480384. doi: 10.3389/fonc.2025.1480384. eCollection 2025.

Evaluation of automated pediatric sleep stage classification using U-Sleep: a convolutional neural network.

J Clin Sleep Med. 2025 Feb 1;21(2):277-285. doi: 10.5664/jcsm.11362.

引用本文的文献

An important step toward automation of polysomnography analyses.

Sleep. 2025 Aug 14;48(8). doi: 10.1093/sleep/zsaf147.

本文引用的文献

Refining sleep staging accuracy: transfer learning coupled with scorability models.

Sleep. 2024 Nov 8;47(11). doi: 10.1093/sleep/zsae202.

Exploring the Role of Circadian Rhythms in Sleep and Recovery: A Review Article.

Cureus. 2024 Jun 3;16(6):e61568. doi: 10.7759/cureus.61568. eCollection 2024 Jun.

AASM Scoring Manual 3: a step forward for advancing sleep care for patients with obstructive sleep apnea.

J Clin Sleep Med. 2024 May 1;20(5):835-836. doi: 10.5664/jcsm.11040.

Association of Periodic Limb Movements and Obstructive Sleep Apnea With Risk of Cardiovascular Disease and Mortality.

J Am Heart Assoc. 2024 Feb 6;13(3):e031630. doi: 10.1161/JAHA.123.031630. Epub 2024 Jan 19.

Multi-task learning for arousal and sleep stage detection using fully convolutional networks.

J Neural Eng. 2023 Oct 9;20(5). doi: 10.1088/1741-2552/acfe3a.

Adoption of Transformer Neural Network to Improve the Diagnostic Performance of Oximetry for Obstructive Sleep Apnea.

Sensors (Basel). 2023 Sep 15;23(18):7924. doi: 10.3390/s23187924.

Deep learning for obstructive sleep apnea diagnosis based on single channel oximetry.

Nat Commun. 2023 Aug 12;14(1):4881. doi: 10.1038/s41467-023-40604-3.

SViT: A Spectral Vision Transformer for the Detection of REM Sleep Behavior Disorder.

IEEE J Biomed Health Inform. 2023 Sep;27(9):4285-4292. doi: 10.1109/JBHI.2023.3292231. Epub 2023 Sep 6.

Evaluation of consensus sleep stage scoring of dysregulated sleep in Parkinson's disease.

Sleep Med. 2023 Jul;107:236-242. doi: 10.1016/j.sleep.2023.04.031. Epub 2023 May 18.

ProductGraphSleepNet: Sleep staging using product spatio-temporal graph learning with attentive temporal aggregation.

Neural Netw. 2023 Jul;164:667-680. doi: 10.1016/j.neunet.2023.05.016. Epub 2023 May 13.

CAISR: Achieving Human-Level Performance in Automated Sleep Analysis Across All Clinical Sleep Metrics.

作者信息

机构信息

出版信息

STUDY OBJECTIVES

METHODS

RESULTS

CONCLUSIONS

研究目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献