Department of Telecommunications and Media InformaticsBudapest University of Technology and Economics 1117 Budapest Hungary.
IEEE J Transl Eng Health Med. 2023 Dec 7;12:233-244. doi: 10.1109/JTEHM.2023.3340345. eCollection 2024.
Despite speech being the primary communication medium, it carries valuable information about a speaker's health, emotions, and identity. Various conditions can affect the vocal organs, leading to speech difficulties. Extensive research has been conducted by voice clinicians and academia in speech analysis. Previous approaches primarily focused on one particular task, such as differentiating between normal and dysphonic speech, classifying different voice disorders, or estimating the severity of voice disorders.
This study proposes an approach that combines transfer learning and multitask learning (MTL) to simultaneously perform dysphonia classification and severity estimation. Both tasks use a shared representation; network is learned from these shared features. We employed five computer vision models and changed their architecture to support multitask learning. Additionally, we conducted binary 'healthy vs. dysphonia' and multiclass 'healthy vs. organic and functional dysphonia' classification using multitask learning, with the speaker's sex as an auxiliary task.
The proposed method achieved improved performance across all classification metrics compared to single-task learning (STL), which only performs classification or severity estimation. Specifically, the model achieved F1 scores of 93% and 90% in MTL and STL, respectively. Moreover, we observed considerable improvements in both classification tasks by evaluating beta values associated with the weight assigned to the sex-predicting auxiliary task. MTL achieved an accuracy of 77% compared to the STL score of 73.2%. However, the performance of severity estimation in MTL was comparable to STL.
Our goal is to improve how voice pathologists and clinicians understand patients' conditions, make it easier to track their progress, and enhance the monitoring of vocal quality and treatment procedures. Clinical and Translational Impact Statement: By integrating both classification and severity estimation of dysphonia using multitask learning, we aim to enable clinicians to gain a better understanding of the patient's situation, effectively monitor their progress and voice quality.
尽管言语是主要的交流媒介,但它携带着有关说话者健康、情绪和身份的有价值的信息。各种情况都可能影响发声器官,导致言语困难。语音临床医生和学术界已经对语音分析进行了广泛的研究。以前的方法主要侧重于一个特定的任务,例如区分正常语音和发声障碍语音、对不同的语音障碍进行分类,或估计语音障碍的严重程度。
本研究提出了一种结合迁移学习和多任务学习(MTL)的方法,以同时进行发声障碍分类和严重程度估计。这两个任务都使用共享表示;网络从这些共享特征中学习。我们采用了五个计算机视觉模型,并改变了它们的架构以支持多任务学习。此外,我们还使用多任务学习进行了二进制“健康与发声障碍”和多类“健康与器质性和功能性发声障碍”分类,将说话者的性别作为辅助任务。
与仅执行分类或严重程度估计的单任务学习(STL)相比,所提出的方法在所有分类指标上都取得了更好的性能。具体来说,该模型在 MTL 和 STL 中的 F1 得分分别为 93%和 90%。此外,通过评估与性别预测辅助任务相关的权重分配的 beta 值,我们观察到在这两个分类任务中都有相当大的改进。MTL 的准确率为 77%,而 STL 的准确率为 73.2%。然而,MTL 中严重程度估计的性能与 STL 相当。
我们的目标是改善语音病理学家和临床医生对患者病情的理解,使其更容易跟踪患者的进展,并增强对嗓音质量和治疗过程的监测。临床和转化影响陈述:通过使用多任务学习集成发声障碍的分类和严重程度估计,我们旨在使临床医生能够更好地了解患者的情况,有效地监测他们的进展和嗓音质量。