IEEE J Biomed Health Inform. 2023 Jun;27(6):2751-2759. doi: 10.1109/JBHI.2022.3203345. Epub 2023 Jun 6.
Given that depression is one of the most prevalent mental illnesses, developing effective and unobtrusive diagnosis tools is of great importance. Recent work that screens for depression with text messages leverage models relying on lexical category features. Given the colloquial nature of text messages, the performance of these models may be limited by formal lexicons. We thus propose a strategy to automatically construct alternative lexicons that contain more relevant and colloquial terms. Specifically, we generate 36 lexicons from fiction, forum, and news corpuses. These lexicons are then used to extract lexical category features from the text messages. We utilize machine learning models to compare the depression screening capabilities of these lexical category features. Out of our 36 constructed lexicons, 14 achieved statistically significantly higher average F1 scores over the pre-existing formal lexicon and basic bag-of-words approach. In comparison to the pre-existing lexicon, our best performing lexicon increased the average F1 scores by 10%. We thus confirm our hypothesis that less formal lexicons can improve the performance of classification models that screen for depression with text messages. By providing our automatically constructed lexicons, we aid future machine learning research that leverages less formal text.
鉴于抑郁症是最常见的精神疾病之一,开发有效且不引人注目的诊断工具非常重要。最近使用短信筛查抑郁症的工作依赖于基于词汇类别特征的模型。考虑到短信的口语化性质,这些模型的性能可能受到正式词汇的限制。因此,我们提出了一种策略,自动构建包含更相关和口语化术语的替代词汇。具体来说,我们从小说、论坛和新闻语料库中生成了 36 个词汇。然后,我们使用这些词汇从短信中提取词汇类别特征。我们利用机器学习模型来比较这些词汇类别特征对抑郁症筛查的能力。在我们构建的 36 个词汇中,有 14 个在平均 F1 分数上明显高于现有正式词汇和基本词袋方法。与现有词汇相比,我们表现最好的词汇将平均 F1 分数提高了 10%。因此,我们证实了我们的假设,即不那么正式的词汇可以提高使用短信筛查抑郁症的分类模型的性能。通过提供我们自动构建的词汇,我们为利用不那么正式的文本的未来机器学习研究提供了帮助。