Suppr超能文献

跨越“曲奇盗窃”语料库的鸿沟:将BERT从外部数据中学到的知识应用于ADReSS挑战痴呆症检测任务

Crossing the 'Cookie Theft' Corpus Chasm: Applying what BERT Learns from Outside Data to the ADReSS Challenge Dementia Detection Task.

作者信息

Guo Yue, Li Changye, Roan Carol, Pakhomov Serguei, Cohen Trevor

机构信息

Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, Washington, USA.

Pharmaceutical Care and Health Systems, University of Minnesota, Minneapolis, Minnesota, USA.

出版信息

Front Comput Sci. 2021 Apr;3. doi: 10.3389/fcomp.2021.642517. Epub 2021 Apr 16.

Abstract

Large amounts of labeled data are a prerequisite to training accurate and reliable machine learning models. However, in the medical domain in particular, this is also a stumbling block as accurately labeled data are hard to obtain. DementiaBank, a publicly available corpus of spontaneous speech samples from a picture description task widely used to study Alzheimer's disease (AD) patients' language characteristics and for training classification models to distinguish patients with AD from healthy controls, is relatively small - a limitation that is further exacerbated when restricting to the balanced subset used in the Alzheimer's Dementia Recognition through Spontaneous Speech (ADReSS) challenge. We build on previous work showing that the performance of traditional machine learning models on DementiaBank can be improved by the addition of normative data from other sources, evaluating the utility of such extrinsic data to further improve the performance of state-of-the-art deep learning based methods on the ADReSS challenge dementia detection task. To this end, we developed a new corpus of professionally transcribed recordings from the Wisconsin Longitudinal Study (WLS), resulting in 1366 additional Cookie Theft Task transcripts, increasing the available training data by an order of magnitude. Using these data in conjunction with DementiaBank is challenging because the WLS metadata corresponding to these transcripts do not contain dementia diagnoses. However, cognitive status of WLS participants can be inferred from results of several cognitive tests including semantic verbal fluency available in WLS data. In this work, we evaluate the utility of using the entire WLS corpus as normative data as well as selecting normative data based on the inferred cognitive status for training deep learning models to discriminate between language produced by patients with dementia and healthy controls. We find that incorporating WLS data during training a BERT model on ADReSS data improves its performance on the ADReSS dementia detection task, supporting the hypotheses that incorporating WLS data adds value in this context. We also demonstrate that weighted cost functions and additional prediction targets may be effective ways to address issues arising from class imbalance and confounding effects due to data provenance.

摘要

大量的标注数据是训练准确可靠的机器学习模型的先决条件。然而,特别是在医学领域,这也是一个绊脚石,因为准确标注的数据很难获得。痴呆症语料库(DementiaBank)是一个公开可用的自发语音样本语料库,来自一个广泛用于研究阿尔茨海默病(AD)患者语言特征以及训练分类模型以区分AD患者和健康对照的图片描述任务,它相对较小——当限制在通过自发语音进行阿尔茨海默病痴呆识别(ADReSS)挑战赛中使用的平衡子集时,这一局限性会进一步加剧。我们基于之前的工作展开,之前的工作表明,通过添加来自其他来源的规范数据,可以提高传统机器学习模型在痴呆症语料库上的性能,我们评估这种外部数据在进一步提高基于深度学习的先进方法在ADReSS挑战赛痴呆症检测任务中的性能方面的效用。为此,我们开发了一个来自威斯康星纵向研究(WLS)的专业转录录音的新语料库,产生了1366个额外的《偷饼干任务》转录本,使可用训练数据增加了一个数量级。将这些数据与痴呆症语料库结合使用具有挑战性,因为与这些转录本对应的WLS元数据不包含痴呆症诊断信息。然而,可以从包括WLS数据中可用的语义言语流畅性在内的多项认知测试结果推断WLS参与者的认知状态。在这项工作中,我们评估了将整个WLS语料库用作规范数据以及根据推断的认知状态选择规范数据以训练深度学习模型来区分痴呆症患者和健康对照产生的语言的效用。我们发现,在基于ADReSS数据训练BERT模型时纳入WLS数据可提高其在ADReSS痴呆症检测任务中的性能,支持了在这种情况下纳入WLS数据会增加价值的假设。我们还证明,加权成本函数和额外的预测目标可能是解决因类别不平衡和数据来源导致的混杂效应而产生的问题的有效方法。

相似文献

3
Stigma Management Strategies of Autistic Social Media Users.自闭症社交媒体用户的污名管理策略
Autism Adulthood. 2025 May 28;7(3):273-282. doi: 10.1089/aut.2023.0095. eCollection 2025 Jun.

本文引用的文献

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验