神农：一个用于音频语音特征提取的 Python 工具包。

Shennong: A Python toolbox for audio speech features extraction.

机构信息

Cognitive Machine Learning, PSL Research University, CNRS, EHESS, ENS, Inria, Paris, France.

EconomiX (UMR 7235), Université Paris Nanterre, CNRS, Nanterre, France.

出版信息

Behav Res Methods. 2023 Dec;55(8):4489-4501. doi: 10.3758/s13428-022-02029-6. Epub 2023 Feb 7.

DOI:10.3758/s13428-022-02029-6

PMID:36750521

Abstract

We introduce Shennong, a Python toolbox and command-line utility for audio speech features extraction. It implements a wide range of well-established state-of-the-art algorithms: spectro-temporal filters such as Mel-Frequency Cepstral Filterbank or Predictive Linear Filters, pre-trained neural networks, pitch estimators, speaker normalization methods, and post-processing algorithms. Shennong is an open source, reliable and extensible framework built on top of the popular Kaldi speech processing library. The Python implementation makes it easy to use by non-technical users and integrates with third-party speech modeling and machine learning tools from the Python ecosystem. This paper describes the Shennong software architecture, its core components, and implemented algorithms. Then, three applications illustrate its use. We first present a benchmark of speech features extraction algorithms available in Shennong on a phone discrimination task. We then analyze the performances of a speaker normalization model as a function of the speech duration used for training. We finally compare pitch estimation algorithms on speech under various noise conditions.

摘要

我们介绍了 Shennong，这是一个用于音频语音特征提取的 Python 工具包和命令行实用程序。它实现了广泛的成熟的最先进算法：频谱时域滤波器，如梅尔频率倒谱滤波器组或预测线性滤波器、预训练的神经网络、音高估计器、说话人归一化方法和后处理算法。Shennong 是一个开源的、可靠的、可扩展的框架，建立在流行的 Kaldi 语音处理库之上。Python 实现使非技术用户易于使用，并与来自 Python 生态系统的第三方语音建模和机器学习工具集成。本文描述了 Shennong 软件架构、其核心组件和实现的算法。然后，通过三个应用程序说明了它的使用。我们首先在电话识别任务上对 Shennong 中可用的语音特征提取算法进行了基准测试。然后，我们分析了说话人归一化模型的性能作为用于训练的语音持续时间的函数。最后，我们比较了在各种噪声条件下的音高估计算法。