Wu Yuqi, Mao Kaining, Zhang Yanbo, Chen Jie
IEEE J Biomed Health Inform. 2024 Dec;28(12):7531-7542. doi: 10.1109/JBHI.2024.3435085. Epub 2024 Dec 5.
The global prevalence of mental health disorders is increasing, leading to a significant economic burden estimated in trillions of dollars. In automated mental health diagnosis, the scarcity and imbalance of clinical data pose considerable challenges for researchers, limiting the effectiveness of machine learning algorithms. To cope with this issue, this paper aims to introduce a novel clinical transcript data augmentation framework by leveraging large language models (CALLM). The framework follows a "patient-doctor role-playing" intuition to generate realistic synthetic data. In addition, our study introduces a unique "Textbook-Assignment-Application" (T-A-A) partitioning approach to offer a systematic means of crafting synthetic clinical interview datasets. Concurrently, we have also developed a "Response-Reason" prompt engineering paradigm to generate highly authentic and diagnostically valuable transcripts. By leveraging a fine-tuned DistilBERT model on the E-DAIC PTSD dataset, we achieved a balanced accuracy of 0.77, an F1-score of 0.70, and an AUC of 0.78 during test set evaluations, which showcase robust adaptability in both Zero-Shot Learning (ZSL) and Few-Shot Learning (FSL) scenarios. We further compare the CALLM framework with other data augmentation methods and PTSD diagnostic works and demonstrates consistent improvements. Compared to conventional data collection methods, our synthetic dataset not only demonstrates superior performance but also incurs less than 1% of the associated costs.
全球精神健康障碍的患病率正在上升,导致了估计达数万亿美元的巨大经济负担。在自动化精神健康诊断中,临床数据的稀缺和不平衡给研究人员带来了巨大挑战,限制了机器学习算法的有效性。为应对这一问题,本文旨在通过利用大语言模型(CALLM)引入一种新颖的临床记录数据增强框架。该框架遵循“医患角色扮演”的思路来生成逼真的合成数据。此外,我们的研究引入了一种独特的“教科书 - 作业 - 应用”(T - A - A)划分方法,以提供一种系统的方式来构建合成临床访谈数据集。同时,我们还开发了一种“回答 - 理由”提示工程范式,以生成高度真实且具有诊断价值的记录。通过在E - DAIC PTSD数据集上微调DistilBERT模型,我们在测试集评估中实现了0.77的平衡准确率、0.70的F1分数和0.78的AUC,这在零样本学习(ZSL)和少样本学习(FSL)场景中都展示了强大的适应性。我们进一步将CALLM框架与其他数据增强方法和PTSD诊断工作进行比较,并展示了持续的改进。与传统数据收集方法相比,我们的合成数据集不仅表现出卓越的性能,而且相关成本不到1%。