Hasan Md Abid, Li Frédéric, Gouverneur Philip, Piet Artur, Grzegorzek Marcin
German Research Center for Artificial Intelligence (DFKI), Lübeck, Germany.
Institute of Medical Informatics, University of Lübeck, Lübeck, Germany.
PLoS One. 2025 Mar 18;20(3):e0315343. doi: 10.1371/journal.pone.0315343. eCollection 2025.
Recent advancements in hardware technology have spurred a surge in the popularity and ubiquity of wearable sensors, opening up new applications within the medical domain. This proliferation has resulted in a notable increase in the availability of Time Series (TS) data characterizing behavioral or physiological information from the patient, leading to initiatives toward leveraging machine learning and data analysis techniques. Nonetheless, the complexity and time required for collecting data remain significant hurdles, limiting dataset sizes and hindering the effectiveness of machine learning. Data Augmentation (DA) stands out as a prime solution, facilitating the generation of synthetic data to address challenges associated with acquiring medical data. DA has shown to consistently improve performances when images are involved. As a result, investigations have been carried out to check DA for TS, in particular for TS classification. However, the current state of DA in TS classification faces challenges, including methodological taxonomies restricted to the univariate case, insufficient direction to select suitable DA methods and a lack of conclusive evidence regarding the amount of synthetic data required to attain optimal outcomes. This paper conducts a comprehensive survey and experiments on DA techniques for TS and their application to TS classification. We propose an updated taxonomy spanning across three families of Time Series Data Augmentation (TSDA): Random Transformation (RT), Pattern Mixing (PM), and Generative Models (GM). Additionally, we empirically evaluate 12 TSDA methods across diverse datasets used in medical-related applications, including OPPORTUNITY and HAR for Human Activity Recognition, DEAP for emotion recognition, BioVid Heat Pain Database (BVDB), and PainMonit Database (PMDB) for pain recognition. Through comprehensive experimental analysis, we identify the most optimal DA techniques and provide recommendations for researchers regarding the generation of synthetic data to maximize outcomes from DA methods. Our findings show that despite their simplicity, DA methods of the RT family are the most consistent in increasing performances compared to not using any augmentation.
硬件技术的最新进展推动了可穿戴传感器的普及和广泛应用,为医学领域开辟了新的应用前景。这种激增导致表征患者行为或生理信息的时间序列(TS)数据的可用性显著增加,从而引发了利用机器学习和数据分析技术的举措。尽管如此,数据收集的复杂性和所需时间仍然是重大障碍,限制了数据集的大小并阻碍了机器学习的有效性。数据增强(DA)作为一种主要解决方案脱颖而出,它有助于生成合成数据,以应对与获取医学数据相关的挑战。当涉及图像时,DA已被证明能持续提高性能。因此,人们已经开展了相关研究来检验TS的DA方法,特别是用于TS分类的方法。然而,TS分类中DA的当前状态面临挑战,包括仅限于单变量情况的方法分类法、选择合适DA方法的指导不足,以及关于获得最佳结果所需合成数据量缺乏确凿证据。本文对TS的DA技术及其在TS分类中的应用进行了全面的调查和实验。我们提出了一个更新的分类法,涵盖时间序列数据增强(TSDA)的三个类别:随机变换(RT)、模式混合(PM)和生成模型(GM)。此外,我们在用于医学相关应用的各种数据集上对12种TSDA方法进行了实证评估,包括用于人类活动识别的OPPORTUNITY和HAR、用于情绪识别的DEAP、用于疼痛识别的BioVid热痛数据库(BVDB)和疼痛监测数据库(PMDB)。通过全面的实验分析,我们确定了最优的DA技术,并为研究人员提供了关于生成合成数据的建议,以最大限度地提高DA方法的效果。我们的研究结果表明,尽管RT家族的DA方法很简单,但与不使用任何增强方法相比,它们在提高性能方面最为稳定。