IEEE Trans Cybern. 2015 Dec;45(12):2890-904. doi: 10.1109/TCYB.2015.2388791. Epub 2015 Jan 22.
Bayesian network (BN) has been adopted as the underlying model for representing and inferring uncertain knowledge. As the basis of realistic applications centered on probabilistic inferences, learning a BN from data is a critical subject of machine learning, artificial intelligence, and big data paradigms. Currently, it is necessary to extend the classical methods for learning BNs with respect to data-intensive computing or in cloud environments. In this paper, we propose a parallel and incremental approach for data-intensive learning of BNs from massive, distributed, and dynamically changing data by extending the classical scoring and search algorithm and using MapReduce. First, we adopt the minimum description length as the scoring metric and give the two-pass MapReduce-based algorithms for computing the required marginal probabilities and scoring the candidate graphical model from sample data. Then, we give the corresponding strategy for extending the classical hill-climbing algorithm to obtain the optimal structure, as well as that for storing a BN by <key, value> pairs. Further, in view of the dynamic characteristics of the changing data, we give the concept of influence degree to measure the coincidence of the current BN with new data, and then propose the corresponding two-pass MapReduce-based algorithms for BNs incremental learning. Experimental results show the efficiency, scalability, and effectiveness of our methods.
贝叶斯网络(BN)已被用作表示和推断不确定知识的基础模型。作为以概率推理为中心的实际应用的基础,从数据中学习 BN 是机器学习、人工智能和大数据范例的关键课题。目前,有必要扩展用于数据密集型计算或云环境中的经典 BN 学习方法。在本文中,我们通过扩展经典的评分和搜索算法并使用 MapReduce 来提出一种用于从大规模、分布式和动态变化的数据中进行数据密集型 BN 学习的并行和增量方法。首先,我们采用最小描述长度作为评分指标,并给出基于两阶段 MapReduce 的算法,用于从样本数据计算所需的边缘概率和对候选图形模型进行评分。然后,我们给出了将经典爬山算法扩展以获得最优结构的相应策略,以及通过<键,值>对存储 BN 的策略。此外,针对数据变化的动态特性,我们给出了影响度的概念来衡量当前 BN 与新数据的一致性,然后提出了相应的基于两阶段 MapReduce 的 BN 增量学习算法。实验结果表明了我们方法的效率、可扩展性和有效性。