School of Computer Science, South China Normal University, Guangzhou 510631, China.
School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China.
Sensors (Basel). 2021 Dec 22;22(1):36. doi: 10.3390/s22010036.
Acoustic scene classification (ASC) tries to inference information about the environment using audio segments. The inter-class similarity is a significant issue in ASC as acoustic scenes with different labels may sound quite similar. In this paper, the similarity relations amongst scenes are correlated with the classification error. A class hierarchy construction method by using classification error is then proposed and integrated into a multitask learning framework. The experiments have shown that the proposed multitask learning method improves the performance of ASC. On the TUT Acoustic Scene 2017 dataset, we obtain the ensemble fine-grained accuracy of 81.4%, which is better than the state-of-the-art. By using multitask learning, the basic Convolutional Neural Network (CNN) model can be improved by about 2.0 to 3.5 percent according to different spectrograms. The coarse category accuracies (for two to six super-classes) range from 77.0% to 96.2% by single models. On the revised version of the LITIS Rouen dataset, we achieve the ensemble fine-grained accuracy of 83.9%. The multitask learning models obtain an improvement of 1.6% to 1.8% compared to their basic models. The coarse category accuracies range from 94.9% to 97.9% for two to six super-classes with single models.
声场景分类 (ASC) 试图使用音频片段推断环境信息。在 ASC 中,类间相似度是一个重要问题,因为具有不同标签的声场景可能听起来非常相似。在本文中,我们将场景之间的相似关系与分类错误相关联。然后提出了一种使用分类错误构建层次结构的方法,并将其集成到多任务学习框架中。实验表明,所提出的多任务学习方法提高了 ASC 的性能。在 TUT 声场景 2017 数据集上,我们获得了 81.4%的集成细粒度准确率,优于现有技术。通过使用多任务学习,基本卷积神经网络 (CNN) 模型可以根据不同的频谱图提高 2.0%到 3.5%。通过单个模型,粗类别准确率(针对两个到六个超类)范围为 77.0%到 96.2%。在修订后的 LITIS 鲁昂数据集上,我们获得了 83.9%的集成细粒度准确率。与基本模型相比,多任务学习模型的准确率提高了 1.6%到 1.8%。通过单个模型,两个到六个超类的粗类别准确率范围为 94.9%到 97.9%。