Namasudra Suyel, Dhamodharavadhani S, Rathipriya R, Crespo Ruben Gonzalez, Moparthi Nageswara Rao
Department of Computer Science and Engineering, National Institute of Technology Agartala, Tripura, India.
Department of Computer Science, Periyar University, Salem, India.
Big Data. 2024 Apr;12(2):83-99. doi: 10.1089/big.2022.0155. Epub 2023 Feb 24.
Big data is a combination of large structured, semistructured, and unstructured data collected from various sources that must be processed before using them in many analytical applications. Anomalies or inconsistencies in big data refer to the occurrences of some data that are in some way unusual and do not fit the general patterns. It is considered one of the major problems of big data. Data trust method (DTM) is a technique used to identify and replace anomaly or untrustworthy data using the interpolation method. This article discusses the DTM used for univariate time series (UTS) forecasting algorithms for big data, which is considered the preprocessing approach by using a neural network (NN) model. In this work, DTM is the combination of statistical-based untrustworthy data detection method and statistical-based untrustworthy data replacement method, and it is used to improve the forecast quality of UTS. In this study, an enhanced NN model has been proposed for big data that incorporates DTMs with the NN-based UTS forecasting model. The coefficient variance root mean squared error is utilized as the main characteristic indicator in the proposed work to choose the best UTS data for model development. The results show the effectiveness of the proposed method as it can improve the prediction process by determining and replacing the untrustworthy big data.
大数据是从各种来源收集的大型结构化、半结构化和非结构化数据的组合,在许多分析应用中使用之前必须对其进行处理。大数据中的异常或不一致是指某些数据以某种方式不寻常且不符合一般模式的情况。它被认为是大数据的主要问题之一。数据信任方法(DTM)是一种使用插值方法识别和替换异常或不可信数据的技术。本文讨论了用于大数据单变量时间序列(UTS)预测算法的DTM,它被认为是使用神经网络(NN)模型的预处理方法。在这项工作中,DTM是基于统计的不可信数据检测方法和基于统计的不可信数据替换方法的组合,用于提高UTS的预测质量。在本研究中,针对大数据提出了一种增强的NN模型,该模型将DTM与基于NN的UTS预测模型相结合。在这项提议的工作中,系数方差均方根误差被用作主要特征指标,以选择用于模型开发的最佳UTS数据。结果表明了所提方法的有效性,因为它可以通过确定和替换不可信的大数据来改进预测过程。