New York University, NY.
Gallaudet University, Washington, DC.
J Speech Lang Hear Res. 2023 May 9;66(5):1541-1562. doi: 10.1044/2023_JSLHR-22-00694. Epub 2023 Apr 14.
Limited research has examined the suitability of crowdsourced ratings to measure treatment effects in speakers with Parkinson's disease (PD), particularly for constructs such as voice quality. This study obtained measures of reliability and validity for crowdsourced listeners' ratings of voice quality in speech samples from a published study. We also investigated whether aggregated listener ratings would replicate the original study's findings of treatment effects based on the Acoustic Voice Quality Index (AVQI) measure.
This study reports a secondary outcome measure of a randomized controlled trial with speakers with dysarthria associated with PD, including two active comparators (Lee Silverman Voice Treatment [LSVT LOUD] and LSVT ARTIC), an inactive comparator (untreated PD), and a healthy control group. Speech samples from three time points (pretreatment, posttreatment, and 6-month follow-up) were presented in random order for rating as "typical" or "atypical" with respect to voice quality. Untrained listeners were recruited through the Amazon Mechanical Turk crowdsourcing platform until each sample had at least 25 ratings.
Intrarater reliability for tokens presented repeatedly was substantial (Cohen's κ = .65-.70), and interrater agreement significantly exceeded chance level. There was a significant correlation of moderate magnitude between the AVQI and the proportion of listeners classifying a given sample as "typical." Consistent with the original study, we found a significant interaction between group and time point, with the LSVT LOUD group alone showing significantly higher perceptually rated voice quality at posttreatment and follow-up relative to the pretreatment time point.
These results suggest that crowdsourcing can be a valid means to evaluate clinical speech samples, even for less familiar constructs such as voice quality. The findings also replicate the results of the study by Moya-Galé et al. (2022) and support their functional relevance by demonstrating that the effects of treatment measured acoustically in that study are perceptually apparent to everyday listeners.
有限的研究考察了众包评分在帕金森病(PD)患者的治疗效果评估中的适用性,特别是在语音质量等构建方面。本研究获得了众包听众对发表研究中语音样本的语音质量的评分的可靠性和有效性的测量值。我们还研究了基于声学语音质量指数(AVQI)测量值,汇总的听众评分是否会复制原始研究中基于治疗效果的发现。
本研究报告了一项与 PD 相关的构音障碍患者的随机对照试验的次要结果测量,包括两种活性对照(LSVT LOUD 和 LSVT ARTIC)、一种非活性对照(未经治疗的 PD)和一个健康对照组。在三个时间点(治疗前、治疗后和 6 个月随访)以随机顺序呈现语音样本,以评估声音质量是否为“典型”或“非典型”。通过 Amazon Mechanical Turk 众包平台招募未受过训练的听众,直到每个样本至少有 25 个评分。
令牌重复呈现的内部评分者可靠性较高(Cohen's κ =.65-.70),评分者间的一致性显著超过机会水平。AVQI 与将给定样本分类为“典型”的听众比例之间存在中度显著相关。与原始研究一致,我们发现组与时间点之间存在显著的交互作用,只有 LSVT LOUD 组在治疗后和随访时的感知声音质量评分明显高于治疗前时间点。
这些结果表明,众包可以成为评估临床语音样本的有效手段,即使是对于声音质量等不太熟悉的结构。这些发现也复制了 Moya-Galé 等人(2022)的研究结果,并通过证明该研究中测量的治疗效果在声学上是可感知的,从而支持其功能相关性。