通过自然韵律的机器学习改善语音合成。

Voice Synthesis Improvement by Machine Learning of Natural Prosody.

机构信息

Cyber Security Cooperative Research Centre, Edith Cowan University, 270 Joondalup Drive, Joondalup, WA 6027, Australia.

Security Research Institute, Edith Cowan University, Joondalup, WA 6027, Australia.

出版信息

Sensors (Basel). 2024 Mar 1;24(5):1624. doi: 10.3390/s24051624.

DOI:10.3390/s24051624

PMID:38475158

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10934073/

Abstract

Since the advent of modern computing, researchers have striven to make the human-computer interface (HCI) as seamless as possible. Progress has been made on various fronts, e.g., the desktop metaphor (interface design) and natural language processing (input). One area receiving attention recently is voice activation and its corollary, computer-generated speech. Despite decades of research and development, most computer-generated voices remain easily identifiable as non-human. Prosody in speech has two primary components-intonation and rhythm-both often lacking in computer-generated voices. This research aims to enhance computer-generated text-to-speech algorithms by incorporating melodic and prosodic elements of human speech. This study explores a novel approach to add prosody by using machine learning, specifically an LSTM neural network, to add paralinguistic elements to a recorded or generated voice. The aim is to increase the realism of computer-generated text-to-speech algorithms, to enhance electronic reading applications, and improved artificial voices for those in need of artificial assistance to speak. A computer that is able to also convey meaning with a spoken audible announcement will also improve human-to-computer interactions. Applications for the use of such an algorithm may include improving high-definition audio codecs for telephony, renewing old recordings, and lowering barriers to the utilization of computing. This research deployed a prototype modular platform for digital speech improvement by analyzing and generalizing algorithms into a modular system through laboratory experiments to optimize combinations and performance in edge cases. The results were encouraging, with the LSTM-based encoder able to produce realistic speech. Further work will involve optimizing the algorithm and comparing its performance against other approaches.

摘要

自现代计算出现以来，研究人员一直致力于使人机界面（HCI）尽可能无缝。在各个方面都取得了进展，例如桌面隐喻（界面设计）和自然语言处理（输入）。最近受到关注的一个领域是语音激活及其推论，即计算机生成的语音。尽管经过几十年的研究和开发，大多数计算机生成的语音仍然很容易被识别为非人类。语音中的韵律有两个主要组成部分——语调和谐奏——这两个部分通常在计算机生成的语音中都缺乏。本研究旨在通过将人类语音的旋律和韵律元素纳入到计算机生成的文本到语音算法中，来增强计算机生成的文本到语音算法。本研究探讨了一种通过使用机器学习（特别是 LSTM 神经网络）为记录或生成的语音添加韵律元素来添加韵律的新方法。目的是提高计算机生成的文本到语音算法的逼真度，增强电子阅读应用程序，并改善需要人工辅助说话的人的人工语音。能够通过可听的语音宣布来传达意义的计算机也将改善人机交互。此类算法的应用包括改进电话用高清晰度音频编解码器、更新旧录音以及降低使用计算的障碍。本研究通过实验室实验将分析和推广算法到模块化系统，部署了用于数字语音改进的原型模块化平台，以优化边缘情况下的组合和性能。结果令人鼓舞，基于 LSTM 的编码器能够生成逼真的语音。进一步的工作将涉及优化算法并将其性能与其他方法进行比较。