Carnegie Institution for Science, Stanford, United States.
University of California, Irvine, Irvine, United States.
Sci Data. 2020 May 26;7(1):155. doi: 10.1038/s41597-020-0483-x.
Electricity usage (demand) data are used by utilities, governments, and academics to model electric grids for a variety of planning (e.g., capacity expansion and system operation) purposes. The U.S. Energy Information Administration collects hourly demand data from all balancing authorities (BAs) in the contiguous United States. As of September 2019, we find 2.2% of the demand data in their database are missing. Additionally, 0.5% of reported quantities are either negative values or are otherwise identified as outliers. With the goal of attaining non-missing, continuous, and physically plausible demand data to facilitate analysis, we developed a screening process to identify anomalous values. We then applied a Multiple Imputation by Chained Equations (MICE) technique to impute replacements for missing and anomalous values. We conduct cross-validation on the MICE technique by marking subsets of plausible data as missing, and using the remaining data to predict this "missing" data. The mean absolute percentage error of imputed values is 3.5% across all BAs. The cleaned data are published and available open access: https://doi.org/10.5281/zenodo.3690240.
电力使用(需求)数据被公用事业公司、政府和学术界用于为各种规划目的(如容量扩展和系统运行)建模电网。美国能源信息署从美国大陆的所有平衡区(BAs)收集每小时的需求数据。截至 2019 年 9 月,我们发现数据库中有 2.2%的需求数据丢失。此外,报告数量中有 0.5%要么是负值,要么被确定为异常值。为了获得非缺失、连续和符合物理规律的需求数据以促进分析,我们开发了一个筛选过程来识别异常值。然后,我们应用了链式方程多重插补(MICE)技术来插补缺失值和异常值的替换值。我们通过将合理数据的子集标记为缺失,并使用其余数据来预测这些“缺失”数据,对 MICE 技术进行交叉验证。所有 BAs 的插补值的平均绝对百分比误差为 3.5%。经过清理的数据已发布并可公开获取:https://doi.org/10.5281/zenodo.3690240。