K. Lisa Yang Center for Conservation Bioacoustics, Cornell Lab of Ornithology, Cornell University, Ithaca, NY, USA.
Marine Mammal Institute, Department of Fisheries, Wildlife, and Conservation Sciences, Oregon State University, Corvallis, OR, USA.
J R Soc Interface. 2021 Jul;18(180):20210297. doi: 10.1098/rsif.2021.0297. Epub 2021 Jul 21.
Many animals rely on long-form communication, in the form of songs, for vital functions such as mate attraction and territorial defence. We explored the prospect of improving automatic recognition performance by using the temporal context inherent in song. The ability to accurately detect sequences of calls has implications for conservation and biological studies. We show that the performance of a convolutional neural network (CNN), designed to detect song notes (calls) in short-duration audio segments, can be improved by combining it with a recurrent network designed to process sequences of learned representations from the CNN on a longer time scale. The combined system of independently trained CNN and long short-term memory (LSTM) network models exploits the temporal patterns between song notes. We demonstrate the technique using recordings of fin whale () songs, which comprise patterned sequences of characteristic notes. We evaluated several variants of the CNN + LSTM network. Relative to the baseline CNN model, the CNN + LSTM models reduced performance variance, offering a 9-17% increase in area under the precision-recall curve and a 9-18% increase in peak F1-scores. These results show that the inclusion of temporal information may offer a valuable pathway for improving the automatic recognition and transcription of wildlife recordings.
许多动物依赖于长形式的通讯,例如歌曲,以实现重要的功能,如吸引配偶和防御领地。我们探索了利用歌曲中固有的时间上下文来提高自动识别性能的可能性。准确检测呼叫序列的能力对保护和生物研究具有重要意义。我们表明,设计用于在短持续时间音频段中检测歌曲音符(呼叫)的卷积神经网络(CNN)的性能可以通过将其与循环网络相结合来提高,该网络旨在在更长的时间尺度上处理从 CNN 学习的表示序列。独立训练的 CNN 和长短时记忆 (LSTM) 网络模型的组合系统利用了歌曲音符之间的时间模式。我们使用长须鲸()歌曲的录音演示了该技术,其中包括特征音符的模式序列。我们评估了 CNN + LSTM 网络的几种变体。与基线 CNN 模型相比,CNN + LSTM 模型降低了性能方差,在精度-召回曲线下面积方面提高了 9-17%,在峰值 F1 得分方面提高了 9-18%。这些结果表明,包含时间信息可能为提高野生动物录音的自动识别和转录提供有价值的途径。