Department of Neurology, Beth Israel Deaconess Medical Center, Boston, MA, USA.
McCance Center for Brain Health, Massachusetts General Hospital, Boston, MA, USA.
Sleep. 2024 Nov 8;47(11). doi: 10.1093/sleep/zsae202.
This study aimed to (1) improve sleep staging accuracy through transfer learning (TL), to achieve or exceed human inter-expert agreement and (2) introduce a scorability model to assess the quality and trustworthiness of automated sleep staging.
A deep neural network (base model) was trained on a large multi-site polysomnography (PSG) dataset from the United States. TL was used to calibrate the model to a reduced montage and limited samples from the Korean Genome and Epidemiology Study (KoGES) dataset. Model performance was compared to inter-expert reliability among three human experts. A scorability assessment was developed to predict the agreement between the model and human experts.
Initial sleep staging by the base model showed lower agreement with experts (κ = 0.55) compared to the inter-expert agreement (κ = 0.62). Calibration with 324 randomly sampled training cases matched expert agreement levels. Further targeted sampling improved performance, with models exceeding inter-expert agreement (κ = 0.70). The scorability assessment, combining biosignal quality and model confidence features, predicted model-expert agreement moderately well (R² = 0.42). Recordings with higher scorability scores demonstrated greater model-expert agreement than inter-expert agreement. Even with lower scorability scores, model performance was comparable to inter-expert agreement.
Fine-tuning a pretrained neural network through targeted TL significantly enhances sleep staging performance for an atypical montage, achieving and surpassing human expert agreement levels. The introduction of a scorability assessment provides a robust measure of reliability, ensuring quality control and enhancing the practical application of the system before deployment. This approach marks an important advancement in automated sleep analysis, demonstrating the potential for AI to exceed human performance in clinical settings.
本研究旨在(1)通过迁移学习(TL)提高睡眠分期准确性,达到或超过人类专家间的一致性,并(2)引入一个可评分模型来评估自动睡眠分期的质量和可信度。
一个深度神经网络(基础模型)在来自美国的大型多站点多导睡眠图(PSG)数据集上进行训练。TL 用于校准模型,使其适应来自韩国基因组和流行病学研究(KoGES)数据集的缩小导联和有限样本。将模型性能与三位人类专家的专家间可靠性进行比较。开发了一个可评分评估来预测模型与人类专家之间的一致性。
基础模型最初的睡眠分期与专家的一致性较低(κ=0.55),与专家间的一致性(κ=0.62)相比。使用 324 个随机抽样的训练案例进行校准,与专家的一致性水平相匹配。进一步有针对性的抽样提高了性能,模型超过了专家间的一致性(κ=0.70)。可评分评估结合了生物信号质量和模型置信度特征,可适度预测模型-专家的一致性(R²=0.42)。评分较高的记录显示出与专家的更大一致性,而评分较低的记录则具有与专家间的一致性相当的性能。
通过有针对性的 TL 微调预训练神经网络,显著提高了不典型导联的睡眠分期性能,达到并超过了人类专家的一致性水平。引入可评分评估提供了一种可靠的可靠性衡量标准,确保了质量控制,并在部署前增强了系统的实际应用。这种方法标志着自动睡眠分析的重要进展,展示了 AI 在临床环境中超越人类表现的潜力。