Embedded Systems Laboratory (ESL), EPFL, Lausanne, 1015, Switzerland.
Sci Data. 2021 Jun 23;8(1):156. doi: 10.1038/s41597-021-00937-4.
Cough audio signal classification has been successfully used to diagnose a variety of respiratory conditions, and there has been significant interest in leveraging Machine Learning (ML) to provide widespread COVID-19 screening. The COUGHVID dataset provides over 25,000 crowdsourced cough recordings representing a wide range of participant ages, genders, geographic locations, and COVID-19 statuses. First, we contribute our open-sourced cough detection algorithm to the research community to assist in data robustness assessment. Second, four experienced physicians labeled more than 2,800 recordings to diagnose medical abnormalities present in the coughs, thereby contributing one of the largest expert-labeled cough datasets in existence that can be used for a plethora of cough audio classification tasks. Finally, we ensured that coughs labeled as symptomatic and COVID-19 originate from countries with high infection rates. As a result, the COUGHVID dataset contributes a wealth of cough recordings for training ML models to address the world's most urgent health crises.
咳嗽音频信号分类已成功用于诊断多种呼吸系统疾病,并且人们对利用机器学习(ML)进行广泛的 COVID-19 筛查产生了浓厚的兴趣。COUGHVID 数据集提供了超过 25000 个众包咳嗽录音,代表了广泛的参与者年龄、性别、地理位置和 COVID-19 状态。首先,我们将开源的咳嗽检测算法贡献给研究社区,以协助进行数据稳健性评估。其次,四名经验丰富的医生对 2800 多个录音进行了标记,以诊断咳嗽中存在的医学异常,从而贡献了目前存在的最大的专家标记咳嗽数据集之一,可用于众多咳嗽音频分类任务。最后,我们确保标记为症状性咳嗽和 COVID-19 的录音来自感染率较高的国家。因此,COUGHVID 数据集为训练 ML 模型提供了丰富的咳嗽录音,以应对世界上最紧迫的健康危机。