Rish Irina, Brodie Mark, Ma Sheng, Odintsova Natalia, Beygelzimer Alina, Grabarnik Genady, Hernandez Karina
IBM T.J. Watson Research Center, Hawthorne, NY 10532, USA.
IEEE Trans Neural Netw. 2005 Sep;16(5):1088-109. doi: 10.1109/TNN.2005.853423.
Real-time problem diagnosis in large distributed computer systems and networks is a challenging task that requires fast and accurate inferences from potentially huge data volumes. In this paper, we propose a cost-efficient, adaptive diagnostic technique called active probing. Probes are end-to-end test transactions that collect information about the performance of a distributed system. Active probing uses probabilistic reasoning techniques combined with information-theoretic approach, and allows a fast online inference about the current system state via active selection of only a small number of most-informative tests. We demonstrate empirically that the active probing scheme greatly reduces both the number of probes (from 60% to 75% in most of our real-life applications), and the time needed for localizing the problem when compared with nonadaptive (preplanned) probing schemes. We also provide some theoretical results on the complexity of probe selection, and the effect of "noisy" probes on the accuracy of diagnosis. Finally, we discuss how to model the system's dynamics using dynamic Bayesian networks (DBNs), and an efficient approximate approach called sequential multifault; empirical results demonstrate clear advantage of such approaches over "static" techniques that do not handle system's changes.
在大型分布式计算机系统和网络中进行实时问题诊断是一项具有挑战性的任务,它需要从潜在的海量数据中进行快速准确的推理。在本文中,我们提出了一种经济高效的自适应诊断技术,称为主动探测。探测是端到端的测试事务,用于收集有关分布式系统性能的信息。主动探测使用概率推理技术与信息论方法相结合,并通过仅主动选择少量信息量最大的测试,实现对当前系统状态的快速在线推理。我们通过实验证明,与非自适应(预先计划)探测方案相比,主动探测方案大大减少了探测次数(在我们的大多数实际应用中减少了60%至75%)以及定位问题所需的时间。我们还提供了一些关于探测选择复杂性以及“噪声”探测对诊断准确性影响的理论结果。最后,我们讨论了如何使用动态贝叶斯网络(DBN)对系统动态进行建模,以及一种称为顺序多故障的高效近似方法;实验结果表明,与不处理系统变化的“静态”技术相比,此类方法具有明显优势。