Interlandi Matteo, Ekmekji Ari, Shah Kshitij, Gulzar Muhammad Ali, Tetali Sai Deep, Kim Miryung, Millstein Todd, Condie Tyson
Microsoft, Redmond, WA, USA.
Stanford University, Stanford, CA, USA.
VLDB J. 2018 Oct;27(5):595-615. doi: 10.1007/s00778-017-0474-5. Epub 2017 Aug 7.
Debugging data processing logic in data-intensive scalable computing (DISC) systems is a difficult and time-consuming effort. Today's DISC systems offer very little tooling for debugging programs, and as a result, programmers spend countless hours collecting evidence (e.g., from log files) and performing trial-and-error debugging. To aid this effort, we built , a library that enables -tracking data through transformations-in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds-orders of magnitude faster than alternative solutions-while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.
在数据密集型可扩展计算(DISC)系统中调试数据处理逻辑是一项困难且耗时的工作。当今的DISC系统提供的用于调试程序的工具非常少,因此,程序员花费大量时间收集证据(例如从日志文件中收集)并进行反复调试。为了协助这项工作,我们构建了Titian,这是一个能够在Apache Spark中通过转换跟踪数据的库。使用Titian Spark扩展的数据科学家将能够快速识别潜在错误或异常结果根源处的输入数据。Titian直接内置于Spark平台中,以交互速度提供数据溯源支持,比其他替代解决方案快几个数量级,同时对Spark作业性能的影响最小;观察到的捕获数据沿袭的开销很少超过基线作业执行时间的30%。