Research and Innovation, ATOS Spain SA, 48013 Bilbao, Spain.
DeustoTech, University of Deusto, 48007 Bilbao, Spain.
Sensors (Basel). 2021 Sep 10;21(18):6063. doi: 10.3390/s21186063.
Data scientists spend much time with data cleaning tasks, and this is especially important when dealing with data gathered from sensors, as finding failures is not unusual (there is an abundance of research on anomaly detection in sensor data). This work analyzes several aspects of the data generated by different sensor types to understand particularities in the data, linking them with existing data mining methodologies. Using data from different sources, this work analyzes how the type of sensor used and its measurement units have an important impact in basic statistics such as variance and mean, because of the statistical distributions of the datasets. The work also analyzes the behavior of outliers, how to detect them, and how they affect the equivalence of sensors, as equivalence is used in many solutions for identifying anomalies. Based on the previous results, the article presents guidance on how to deal with data coming from sensors, in order to understand the characteristics of sensor datasets, and proposes a parallelized implementation. Finally, the article shows that the proposed decision-making processes work well with a new type of sensor and that parallelizing with several cores enables calculations to be executed up to four times faster.
数据科学家花费大量时间进行数据清理任务,当处理从传感器收集的数据时,这一点尤其重要,因为发现故障并不罕见(关于传感器数据中的异常检测有大量研究)。这项工作分析了不同传感器类型生成的数据的几个方面,以了解数据的特殊性,并将其与现有的数据挖掘方法联系起来。使用来自不同来源的数据,这项工作分析了所使用的传感器类型及其测量单位如何对基本统计数据(如方差和平均值)产生重要影响,这是由于数据集的统计分布造成的。该工作还分析了异常值的行为、如何检测它们以及它们如何影响传感器的等效性,因为在许多用于识别异常的解决方案中都使用了等效性。基于之前的结果,本文提出了如何处理来自传感器的数据的指导意见,以便了解传感器数据集的特点,并提出了一种并行实现。最后,本文表明,所提出的决策过程在新型传感器上效果良好,并且使用多个核心并行化可以使计算速度提高至四倍。