大规模环境声数据的可扩展预处理用于生物声学监测。

Scalable preprocessing of high volume environmental acoustic data for bioacoustic monitoring.

机构信息

School of Technology, Environments and Design, University of Tasmania, Hobart, Tasmania, Australia.

出版信息

PLoS One. 2018 Aug 3;13(8):e0201542. doi: 10.1371/journal.pone.0201542. eCollection 2018.

DOI:10.1371/journal.pone.0201542

PMID:30075012

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6075764/

Abstract

In this work, we examine the problem of efficiently preprocessing and denoising high volume environmental acoustic data, which is a necessary step in many bird monitoring tasks. Preprocessing is typically made up of multiple steps which are considered separately from each other. These are often resource intensive, particularly because the volume of data involved is high. We focus on addressing two challenges within this problem: how to combine existing preprocessing tasks while maximising the effectiveness of each step, and how to process this pipeline quickly and efficiently, so that it can be used to process high volumes of acoustic data. We describe a distributed system designed specifically for this problem, utilising a master-slave model with data parallelisation. By investigating the impact of individual preprocessing tasks on each other, and their execution times, we determine an efficient and accurate order for preprocessing tasks within the distributed system. We find that, using a single core, our pipeline executes 1.40 times faster compared to manually executing all preprocessing tasks. We then apply our pipeline in the distributed system and evaluate its performance. We find that our system is capable of preprocessing bird acoustic recordings at a rate of 174.8 seconds of audio per second of real time with 32 cores over 8 virtual machines, which is 21.76 times faster than a serial process.

摘要

在这项工作中，我们研究了有效预处理和去噪大量环境声学数据的问题，这是许多鸟类监测任务的必要步骤。预处理通常由多个步骤组成，这些步骤彼此独立考虑。这些步骤通常需要大量的资源，特别是因为涉及的数据量非常大。我们专注于解决这个问题中的两个挑战：如何在最大化每个步骤的有效性的同时组合现有的预处理任务，以及如何快速有效地处理这个流水线，以便它可以用于处理大量的声学数据。我们描述了一个专门为此问题设计的分布式系统，该系统利用带有数据并行化的主从模型。通过研究单个预处理任务对彼此的影响及其执行时间，我们确定了在分布式系统中预处理任务的有效且准确的顺序。我们发现，使用单个核心，我们的流水线的执行速度比手动执行所有预处理任务快 1.40 倍。然后，我们在分布式系统中应用我们的流水线并评估其性能。我们发现我们的系统能够以每秒 174.8 秒的音频实时速度，在 8 个虚拟机上的 32 个核心上，以每秒 174.8 秒的速度处理鸟类声学记录，比串行处理快 21.76 倍。