Zhang Feiyu, Zhang Luyang, Chen Hongxiang, Xie Jiangjian
School of Technology, Beijing Forestry University, Beijing 100083, China.
Entropy (Basel). 2021 Nov 13;23(11):1507. doi: 10.3390/e23111507.
Deep convolutional neural networks (DCNNs) have achieved breakthrough performance on bird species identification using a spectrogram of bird vocalization. Aiming at the imbalance of the bird vocalization dataset, a single feature identification model (SFIM) with residual blocks and modified, weighted, cross-entropy function was proposed. To further improve the identification accuracy, two multi-channel fusion methods were built with three SFIMs. One of these fused the outputs of the feature extraction parts of three SFIMs (feature fusion mode), the other fused the outputs of the classifiers of three SFIMs (result fusion mode). The SFIMs were trained with three different kinds of spectrograms, which were calculated through short-time Fourier transform, mel-frequency cepstrum transform and chirplet transform, respectively. To overcome the shortage of the huge number of trainable model parameters, transfer learning was used in the multi-channel models. Using our own vocalization dataset as a sample set, it is found that the result fusion mode model outperforms the other proposed models, the best mean average precision (MAP) reaches 0.914. Choosing three durations of spectrograms, 100 ms, 300 ms and 500 ms for comparison, the results reveal that the 300 ms duration is the best for our own dataset. The duration is suggested to be determined based on the duration distribution of bird syllables. As for the performance with the training dataset of BirdCLEF2019, the highest classification mean average precision (cmAP) reached 0.135, which means the proposed model has certain generalization ability.
深度卷积神经网络(DCNNs)在利用鸟类发声的频谱图进行鸟类物种识别方面取得了突破性的性能。针对鸟类发声数据集的不平衡问题,提出了一种具有残差块和改进的加权交叉熵函数的单特征识别模型(SFIM)。为了进一步提高识别准确率,用三个SFIM构建了两种多通道融合方法。其中一种融合了三个SFIM的特征提取部分的输出(特征融合模式),另一种融合了三个SFIM的分类器的输出(结果融合模式)。使用分别通过短时傅里叶变换、梅尔频率倒谱变换和小线调频小波变换计算得到的三种不同频谱图对SFIM进行训练。为了克服大量可训练模型参数的不足,在多通道模型中使用了迁移学习。以我们自己的发声数据集作为样本集,发现结果融合模式模型优于其他提出的模型,最佳平均精度均值(MAP)达到0.914。选择100毫秒、300毫秒和500毫秒三种时长的频谱图进行比较,结果表明300毫秒时长对我们自己的数据集是最佳的。建议根据鸟鸣音节的时长分布来确定时长。至于在BirdCLEF2019训练数据集上的性能,最高分类平均精度均值(cmAP)达到0.135,这意味着所提出的模型具有一定的泛化能力。