所有数据都有用吗？使用有向信息和提升回归树推断因果关系以预测污水和排水系统中的流量。

Are all data useful? Inferring causality to predict flows across sewer and drainage systems using directed information and boosted regression trees.

机构信息

Department of Civil and Environmental Engineering, University of Michigan, Ann Arbor, United States.

School for Environment and Sustainability, University of Michigan, Ann Arbor, United States.

出版信息

Water Res. 2018 Nov 15;145:697-706. doi: 10.1016/j.watres.2018.09.009. Epub 2018 Sep 4.

DOI:10.1016/j.watres.2018.09.009

PMID:30216864

Abstract

As more sensor data become available across urban water systems, it is often unclear which of these new measurements are actually useful and how they can be efficiently ingested to improve predictions. We present a data-driven approach for modeling and predicting flows across combined sewer and drainage systems, which fuses sensor measurements with output of a large numerical simulation model. Rather than adjusting the structure and parameters of the numerical model, as is commonly done when new data become available, our approach instead learns causal relationships between the numerically-modeled outputs, distributed rainfall measurements, and measured flows. By treating an existing numerical model - even one that may be outdated - as just another data stream, we illustrate how to automatically select and combine features that best explain flows for any given location. This allows for new sensor measurements to be rapidly fused with existing knowledge of the system without requiring recalibration of the underlying physics. Our approach, based on Directed Information (DI) and Boosted Regression Trees (BRT), is evaluated by fusing measurements across nearly 30 rain gages, 15 flow locations, and the outputs of a numerical sewer model in the city of Detroit, Michigan: one of the largest combined sewer systems in the world. The results illustrate that the Boosted Regression Trees provide skillful predictions of flow, especially when compared to an existing numerical model. The innovation of this paper is the use of the Directed Information step, which selects only those inputs that are causal with measurements at locations of interest. Better predictions are achieved when the Directed Information step is used because it reduces overfitting during the training phase of the predictive algorithm. In the age of "big water data", this finding highlights the importance of screening all available data sources before using them as inputs to data-driven models, since more may not always be better. We discuss the generalizability of the case study and the requirements of transferring the approach to other systems.

摘要

随着城市水系统中可用的传感器数据越来越多，通常不清楚这些新测量值中哪些是有用的，以及如何有效地将其纳入以提高预测精度。我们提出了一种数据驱动的方法，用于对合流制排水系统的流量进行建模和预测，该方法将传感器测量值与大型数值模拟模型的输出融合在一起。我们的方法不是像通常在新数据可用时那样调整数值模型的结构和参数，而是学习数值模型输出、分布式降雨测量值和测量流量之间的因果关系。通过将现有的数值模型（即使是可能过时的模型）视为另一个数据流，我们说明了如何自动选择和组合最能解释给定位置流量的特征。这使得新的传感器测量值可以快速与系统的现有知识融合，而无需重新校准基础物理。我们的方法基于有向信息（DI）和增强回归树（BRT），通过融合密歇根州底特律市近 30 个雨量计、15 个流量位置和数值下水道模型的输出，对其进行了评估：这是世界上最大的合流制下水道系统之一。结果表明，增强回归树提供了流量的熟练预测，尤其是与现有的数值模型相比。本文的创新之处在于使用有向信息步骤，该步骤仅选择与感兴趣位置的测量值具有因果关系的输入。由于在预测算法的训练阶段减少了过拟合，因此使用有向信息步骤可以实现更好的预测。在“大数据时代”，这一发现强调了在将所有可用数据源用作数据驱动模型的输入之前筛选它们的重要性，因为更多的输入并不总是更好。我们讨论了案例研究的泛化能力以及将该方法转移到其他系统的要求。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

所有数据都有用吗？使用有向信息和提升回归树推断因果关系以预测污水和排水系统中的流量。

Are all data useful? Inferring causality to predict flows across sewer and drainage systems using directed information and boosted regression trees.

机构信息

出版信息

相似文献

所有数据都有用吗？使用有向信息和提升回归树推断因果关系以预测污水和排水系统中的流量。

Are all data useful? Inferring causality to predict flows across sewer and drainage systems using directed information and boosted regression trees.

机构信息

出版信息

相似文献