Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany.
Institute of Biomedical Engineering, TUD Dresden University of Technology, Dresden, Germany.
Sci Rep. 2024 Jul 14;14(1):16239. doi: 10.1038/s41598-024-67022-9.
Aiming to apply automatic arousal detection to support sleep laboratories, we evaluated an optimized, state-of-the-art approach using data from daily work in our university hospital sleep laboratory. Therefore, a machine learning algorithm was trained and evaluated on 3423 polysomnograms of people with various sleep disorders. The model architecture is a U-net that accepts 50 Hz signals as input. We compared this algorithm with models trained on publicly available datasets, and evaluated these models using our clinical dataset, particularly with regard to the effects of different sleep disorders. In an effort to evaluate clinical relevance, we designed a metric based on the error of the predicted arousal index. Our models achieve an area under the precision recall curve (AUPRC) of up to 0.83 and F1 scores of up to 0.81. The model trained on our data showed no age or gender bias and no significant negative effect regarding sleep disorders on model performance compared to healthy sleep. In contrast, models trained on public datasets showed a small to moderate negative effect (calculated using Cohen's d) of sleep disorders on model performance. Therefore, we conclude that state-of-the-art arousal detection on our clinical data is possible with our model architecture. Thus, our results support the general recommendation to use a clinical dataset for training if the model is to be applied to clinical data.
为了将自动唤醒检测应用于睡眠实验室,我们使用来自大学医院睡眠实验室日常工作的数据评估了一种经过优化的最先进方法。因此,我们在 3423 份患有各种睡眠障碍的人的多导睡眠图上训练和评估了一种机器学习算法。该模型的架构是一个 U-Net,它接受 50Hz 的信号作为输入。我们将此算法与在公开数据集上训练的模型进行了比较,并使用我们的临床数据集评估了这些模型,特别是针对不同睡眠障碍的影响。为了评估临床相关性,我们设计了一个基于预测唤醒指数误差的指标。我们的模型在精度召回曲线下面积(AUPRC)达到 0.83,F1 分数达到 0.81。与健康睡眠相比,在我们的数据上训练的模型没有年龄或性别偏见,也没有显示出睡眠障碍对模型性能的显著负面影响。相比之下,在公共数据集上训练的模型显示出睡眠障碍对模型性能的小到中等的负面影响(使用 Cohen's d 计算)。因此,我们得出结论,我们的模型架构可以在我们的临床数据上实现最先进的唤醒检测。因此,如果模型要应用于临床数据,我们的结果支持使用临床数据集进行训练的一般建议。