Moeller Annemette L, Perslev Mathias, Paulsrud Cecilie, Thorsen Steffen U, Leonthin Helle, Debes Nanette M, Svensson Jannet, Jennum Poul
Department of Clinical Research, Steno Diabetes Center Copenhagen, Herlev, Denmark.
Department of Clinical Neurophysiology, Danish Center for Sleep Medicine, Glostrup, Denmark.
Sleep. 2025 Jul 11;48(7). doi: 10.1093/sleep/zsaf053.
The manual annotation of polysomnography (PSG) hypnograms is difficult and time-consuming. U-Sleep is an alternative, fast, and publicly available, automated sleep staging system evaluated in adult PSGs. In this study, we compare the staging done by sleep experts and U-sleep in a pediatric sample.
PSGs from 56 children aged 6-17 years old (healthy or with a chronic disease) were compared manually annotated with the result of U-sleep. The two outcomes were compared using F1 overlap scores, accuracy, Cohen's kappa, and correlation coefficients. A qualitative analysis of the most significant systematic differences between the manual and automated scoring was performed.
U-sleep matched the manually scored hypnograms with an overall mean F1 score (predicted performance) of 0.75 and reached an accuracy of 83.9% and an overall kappa value of 0.77. The stage-wise F1 scores, U-sleep achieved an F1 score of 0.79 in stage wake, 0.40 in N1, 0.86 in N2, 0.84 in N3, and 0.86 in REM. The correlation between U-sleep and the manual scorer was moderately or very strong in all sleep stages (r = .57-.81).
Overall, there is a high degree of agreement between manual and automatic scoring. This suggests that U-sleep is a valid and effective method for identifying sleep stages based on normal PSGs in a pediatric population. The disagreement was within what is expected for interscorer variation. Further evaluation needs of AI sleep-scoring models include analysis of outliers and pathological sleep staging-which is also a challenge in manual annotation.
多导睡眠图(PSG)睡眠图的人工标注既困难又耗时。U-Sleep是一种可供选择的、快速且公开可用的自动睡眠分期系统,已在成人PSG中进行了评估。在本研究中,我们比较了睡眠专家和U-Sleep在儿科样本中的分期情况。
对56名6至17岁儿童(健康或患有慢性病)的PSG进行人工标注,并与U-Sleep的结果进行比较。使用F1重叠分数、准确率、科恩kappa系数和相关系数对两种结果进行比较。对人工评分和自动评分之间最显著的系统差异进行了定性分析。
U-Sleep与人工评分的睡眠图总体平均F1分数(预测性能)为0.75,准确率达到83.9%,总体kappa值为0.77。按阶段划分的F1分数方面,U-Sleep在清醒阶段的F1分数为0.79,N1阶段为0.40,N2阶段为0.86,N3阶段为0.84,快速眼动(REM)阶段为0.86。在所有睡眠阶段,U-Sleep与人工评分者之间的相关性为中度或非常强(r = 0.57 - 0.81)。
总体而言,人工评分和自动评分之间存在高度一致性。这表明U-Sleep是一种基于儿科人群正常PSG识别睡眠阶段的有效方法。分歧在评分者间差异的预期范围内。人工智能睡眠评分模型的进一步评估需求包括对异常值和病理性睡眠分期的分析——这也是人工标注中的一个挑战。