Kurland Jacquie, Liu Anna, Varadharaju Vishnupriya, Stokes Polly, Cavanaugh Rob
University of Massachusetts Amherst, Department of Speech, Language, and Hearing Sciences.
University of Massachusetts Amherst, Department of Mathematics and Statistics.
Aphasiology. 2025;39(3):363-384. doi: 10.1080/02687038.2024.2351029. Epub 2024 May 16.
While many measures exist for assessing discourse in aphasia, manual transcription, editing, and scoring are prohibitively labor intensive, a major obstacle to their widespread use by clinicians (Bryant et al., 2017; Cruice et al., 2020). Many tools also lack rigorous psychometric evidence of reliability and validity (Azios et al., 2022; Carragher et al., 2023). Establishing test reliability is the first step in our long-term goal of automating the Brief Assessment of Transactional Success in aphasia (BATS; Kurland et al., 2021) and making it accessible to clinicians and clinical researchers.
We evaluated multiple aspects of test reliability of the BATS by examining correlations between human/machine and human/human interrater edited transcripts, raw vs. edited transcripts, interrater scoring of main concepts, and test-retest performance. We hypothesized that automated methods of transcription and discourse analysis would demonstrate sufficient reliability to move forward with test development.
METHODS & PROCEDURES: We examined 576 story retelling narratives from a sample of 24 persons with aphasia and familiar and unfamiliar conversation partners (CP). Participants with aphasia (PWA) retold stories immediately after watching/listening to short video/audio clips. CP retold stories after six-minute topic-constrained conversations with a PWA in which the dyad co-constructed the stories. We utilized two macrostructural measures to analyze the automated speech-to-text transcripts of story retells: 1) a modified version of a semi-automated tool for measuring main concepts (mainConcept: Cavanaugh et al., 2021); and 2) an automated natural language processing 'pipeline' to assess topic similarity.
OUTCOMES & RESULTS: Correlations between raw and edited scores were excellent, interrater reliability on transcripts and main concept scoring were acceptable. Test-retest on repeated stimuli was acceptable. This was especially true of aphasic story retellings where there were actual within subject repeated stimuli.
Results suggest that automated speech-to-text was generally sufficient in most cases to avoid the time-consuming, labor intensive step of transcribing and editing discourse. Overall, our study results suggest that natural language processing automated methods such as text vectorization and cosine similarity are a fast, efficient way to obtain a measure of topic similarity between two discourse samples. Although test-retest reliability for the semi-automated mainConcept method was generally higher than for automated methods of measuring topic similarity, we found no evidence of a difference between machine automated and human-reliant scoring.
虽然存在多种评估失语症话语的方法,但人工转录、编辑和评分的劳动强度极大,这是临床医生广泛使用这些方法的主要障碍(布莱恩特等人,2017年;克鲁斯等人,2020年)。许多工具也缺乏关于可靠性和有效性的严格心理测量证据(阿齐奥斯等人,2022年;卡拉赫等人,2023年)。建立测试可靠性是我们实现失语症交易成功简短评估(BATS;库兰德等人,2021年)自动化并使其可供临床医生和临床研究人员使用这一长期目标的第一步。
我们通过检查人机和人人间评分者编辑转录本之间的相关性、原始转录本与编辑后转录本之间的相关性、主要概念的评分者间评分以及重测表现,评估了BATS测试可靠性的多个方面。我们假设转录和话语分析的自动化方法将显示出足够的可靠性,以便推进测试开发。
我们检查了来自24名失语症患者以及熟悉和不熟悉的对话伙伴(CP)样本的576个故事复述叙述。失语症患者(PWA)在观看/收听短视频/音频片段后立即复述故事。CP在与PWA进行六分钟主题受限对话后复述故事,在对话中二人共同构建故事。我们使用两种宏观结构测量方法来分析故事复述的自动语音转文本转录本:1)一种用于测量主要概念的半自动工具(主要概念:卡瓦诺等人,2021年)的修改版本;2)一种自动自然语言处理“管道”来评估主题相似性。
原始分数与编辑后分数之间呈极好的相关性,转录本和主要概念评分的评分者间可靠性是可接受的。对重复刺激的重测是可接受的。在失语症故事复述中尤其如此,其中存在实际的受试者内重复刺激。
结果表明,在大多数情况下,自动语音转文本通常足以避免转录和编辑话语这一耗时且劳动强度大的步骤。总体而言,我们的研究结果表明,诸如文本向量化和余弦相似度等自然语言处理自动化方法是获得两个话语样本之间主题相似性度量的快速、有效方法。虽然半自动主要概念方法的重测可靠性通常高于测量主题相似性的自动化方法,但我们没有发现机器自动化评分和人工评分之间存在差异的证据。