Simon Shane, Silverstein Einav, Timmons-Sund Lauren, Pinto Jeremy M, Castro Eugenia M, O'Dell Karla, Johns Iii Michael M, Mack Wendy J, Bensoussan Yael
Departement of Otolaryngology-Head & Neck Surgery, Keck School of Medicine, University of Southern California, Los Angeles, California.
Caruso Department of Otolaryngology, Head and Neck Surgery, Keck School of Medicine, University of Southern California, Los Angeles, California.
J Voice. 2023 Dec 29. doi: 10.1016/j.jvoice.2023.12.008.
There is currently a lack of objective treatment outcome measures for transgender individuals undergoing gender-affirming voice care. Recently, Bensoussan et al developed an AI model that is able to generate a voice femininity rating based on a short voice sample provided through a smartphone application. The purpose of this study was to examine the feasibility of using this model as a treatment outcome measure by comparing its performance to human listeners. Additionally, we examined the effect of two different training datasets on the model's accuracy and performance when presented with external data.
100 voice recordings from 50 cisgender males and 50 cisgender females were retrospectively collected from patients presenting at a university voice clinic for reasons other than dysphonia. The recordings were evaluated by expert and naïve human listeners, who rated each voice based on how sure they were the voice belonged to a female speaker (% voice femininity [R]). Human ratings were compared to ratings generated by (1) the AI model trained on a high-quality low-quantity dataset (voices from the Perceptual Voice Quality Database) (PVQD model), and (2) the AI model trained on a low-quality high-quantity dataset (voices from the Mozilla Common Voice database) (Mozilla model). Ambiguity scores were calculated as the absolute value of the difference between the rating and certainty (0 or 100%).
Both expert and naïve listeners achieved 100% accuracy in identifying voice gender based on a binary classification (female >50% voice femininity [R]). In comparison, the Mozilla-trained model achieved 92% accuracy and the previously published PVQD model achieved 84% accuracy in determining voice gender (female >50% AI voice femininity). While both AI models correlated with human ratings, the Mozilla-trained model showed a stronger correlation as well as lower overall rating ambiguity than the PVQD-trained model. The Mozilla model also appeared to handle pitch information in a similar way to human raters.
The AI model predicted voice gender with high accuracy when compared to human listeners and has potential as a useful outcome measure for transgender individuals receiving gender-affirming voice training. The Mozilla-trained model performed better than the PVQD-trained model, indicating that for binary classification tasks, the quantity of data may influence accuracy more than the quality of the data used for training the voice AI models.
目前,对于接受性别肯定性嗓音治疗的跨性别者,缺乏客观的治疗效果评估指标。最近,本苏桑等人开发了一种人工智能模型,该模型能够根据通过智能手机应用程序提供的简短语音样本生成嗓音女性化评分。本研究的目的是通过将该模型的性能与人类听众的表现进行比较,检验使用该模型作为治疗效果评估指标的可行性。此外,我们还研究了两种不同训练数据集在面对外部数据时对模型准确性和性能的影响。
从一所大学嗓音诊所因非发声困难原因就诊的患者中,回顾性收集了50名顺性别男性和50名顺性别女性的100份语音记录。这些记录由专业和非专业的人类听众进行评估,他们根据对语音属于女性说话者的确定程度对每个语音进行评分(%嗓音女性化[R])。将人类评分与以下两种模型生成的评分进行比较:(1)在高质量低数量数据集(来自感知语音质量数据库的语音)上训练的人工智能模型(PVQD模型),以及(2)在低质量高数量数据集(来自Mozilla通用语音数据库的语音)上训练的人工智能模型(Mozilla模型)。模糊度得分计算为评分与确定性(0或100%)之间差值的绝对值。
在基于二元分类(女性>50%嗓音女性化[R])识别语音性别方面,专业和非专业听众的准确率均达到100%。相比之下,Mozilla训练的模型在确定语音性别(女性>50%人工智能嗓音女性化)方面的准确率为92%,先前发表的PVQD模型的准确率为84%。虽然两种人工智能模型都与人类评分相关,但Mozilla训练的模型显示出更强的相关性,并且总体评分模糊度低于PVQD训练的模型。Mozilla模型在处理音高信息方面似乎也与人类评分者类似。
与人类听众相比,人工智能模型在预测语音性别方面具有较高的准确性,并且有潜力作为接受性别肯定性嗓音训练的跨性别者的有用治疗效果评估指标。Mozilla训练的模型比PVQD训练的模型表现更好,这表明对于二元分类任务,数据的数量可能比用于训练语音人工智能模型的数据质量对准确性的影响更大。