在家庭医疗环境中，合成数据增强能否提高机器学习分类器在患者-护士言语交流中识别健康问题的性能？

Does synthetic data augmentation improve the performances of machine learning classifiers for identifying health problems in patient-nurse verbal communications in home healthcare settings?

作者信息

Scroggins Jihye Kim, Topaz Maxim, Song Jiyoun, Zolnoori Maryam

机构信息

Columbia University School of Nursing, New York, New York, USA.

Data Science Institute, Columbia University, New York, New York, USA.

出版信息

J Nurs Scholarsh. 2025 Jan;57(1):47-58. doi: 10.1111/jnu.13004. Epub 2024 Jul 3.

BACKGROUND

Identifying health problems in audio-recorded patient-nurse communication is important to improve outcomes in home healthcare patients who have complex conditions with increased risks of hospital utilization. Training machine learning classifiers for identifying problems requires resource-intensive human annotation.

OBJECTIVE

To generate synthetic patient-nurse communication and to automatically annotate for common health problems encountered in home healthcare settings using GPT-4. We also examined whether augmenting real-world patient-nurse communication with synthetic data can improve the performance of machine learning to identify health problems.

DESIGN

Secondary data analysis of patient-nurse verbal communication data in home healthcare settings.

METHODS

The data were collected from one of the largest home healthcare organizations in the United States. We used 23 audio recordings of patient-nurse communications from 15 patients. The audio recordings were transcribed verbatim and manually annotated for health problems (e.g., circulation, skin, pain) indicated in the Omaha System Classification scheme. Synthetic data of patient-nurse communication were generated using the in-context learning prompting method, enhanced by chain-of-thought prompting to improve the automatic annotation performance. Machine learning classifiers were applied to three training datasets: real-world communication, synthetic communication, and real-world communication augmented by synthetic communication.

RESULTS

Average F1 scores improved from 0.62 to 0.63 after training data were augmented with synthetic communication. The largest increase was observed using the XGBoost classifier where F1 scores improved from 0.61 to 0.64 (about 5% improvement). When trained solely on either real-world communication or synthetic communication, the classifiers showed comparable F1 scores of 0.62-0.61, respectively.

CONCLUSION

Integrating synthetic data improves machine learning classifiers' ability to identify health problems in home healthcare, with performance comparable to training on real-world data alone, highlighting the potential of synthetic data in healthcare analytics.

CLINICAL RELEVANCE

This study demonstrates the clinical relevance of leveraging synthetic patient-nurse communication data to enhance machine learning classifier performances to identify health problems in home healthcare settings, which will contribute to more accurate and efficient problem identification and detection of home healthcare patients with complex health conditions.

背景

识别录音中患者与护士沟通中的健康问题，对于改善患有复杂病情且住院风险增加的家庭医疗患者的治疗效果至关重要。训练用于识别问题的机器学习分类器需要耗费资源的人工标注。

目的

使用GPT-4生成合成的患者与护士沟通内容，并自动标注家庭医疗环境中常见的健康问题。我们还研究了用合成数据增强现实世界中的患者与护士沟通是否能提高机器学习识别健康问题的性能。

设计

对家庭医疗环境中患者与护士的口头沟通数据进行二次数据分析。

方法

数据收集自美国最大的家庭医疗组织之一。我们使用了来自15名患者的23段患者与护士沟通的音频记录。音频记录被逐字转录，并根据奥马哈系统分类方案中指出的健康问题（如循环、皮肤、疼痛）进行人工标注。使用上下文学习提示方法生成患者与护士沟通的合成数据，并通过思维链提示进行增强，以提高自动标注性能。将机器学习分类器应用于三个训练数据集：现实世界沟通数据、合成沟通数据以及由合成沟通数据增强的现实世界沟通数据。

结果

在用合成沟通数据增强训练数据后，平均F1分数从0.62提高到了0.63。使用XGBoost分类器时观察到最大增幅，F1分数从0.61提高到了0.64（提高了约5%）。当仅在现实世界沟通数据或合成沟通数据上进行训练时，分类器的F1分数分别为0.62和0.61，相当。

结论

整合合成数据可提高机器学习分类器在家庭医疗中识别健康问题的能力，其性能与仅在现实世界数据上训练相当，突出了合成数据在医疗分析中的潜力。

临床意义

本研究证明了利用合成的患者与护士沟通数据来提高机器学习分类器性能以识别家庭医疗环境中健康问题的临床意义，这将有助于更准确、高效地识别和检测患有复杂健康状况的家庭医疗患者的问题。