Shi Chengchun, Lu Wenbin, Song Rui
Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA.
J Mach Learn Res. 2019;20.
Statistical relational learning is primarily concerned with learning and inferring relationships between entities in large-scale knowledge graphs. Nickel et al. (2011) proposed a RESCAL tensor factorization model for statistical relational learning, which achieves better or at least comparable results on common benchmark data sets when compared to other state-of-the-art methods. Given a positive integer , RESCAL computes an -dimensional latent vector for each entity. The latent factors can be further used for solving relational learning tasks, such as collective classification, collective entity resolution and link-based clustering. The focus of this paper is to determine the number of latent factors in the RESCAL model. Due to the structure of the RESCAL model, its log-likelihood function is not concave. As a result, the corresponding maximum likelihood estimators (MLEs) may not be consistent. Nonetheless, we design a specific pseudometric, prove the consistency of the MLEs under this pseudometric and establish its rate of convergence. Based on these results, we propose a general class of information criteria and prove their model selection consistencies when the number of relations is either bounded or diverges at a proper rate of the number of entities. Simulations and real data examples show that our proposed information criteria have good finite sample properties.
统计关系学习主要关注大规模知识图谱中实体之间关系的学习与推断。Nickel等人(2011年)提出了一种用于统计关系学习的RESCAL张量分解模型,与其他现有最先进方法相比,该模型在常见基准数据集上取得了更好或至少相当的结果。给定一个正整数,RESCAL为每个实体计算一个 维的潜在向量。这些潜在因子可进一步用于解决关系学习任务,如集体分类、集体实体解析和基于链接的聚类。本文的重点是确定RESCAL模型中潜在因子的数量。由于RESCAL模型的结构,其对数似然函数不是凹函数。因此,相应的最大似然估计器(MLE)可能不一致。尽管如此,我们设计了一种特定的伪度量,证明了在此伪度量下MLE的一致性,并确定了其收敛速度。基于这些结果,我们提出了一类通用的信息准则,并证明了当关系数量有界或以实体数量的适当速率发散时,它们在模型选择上的一致性。模拟和实际数据示例表明,我们提出的信息准则具有良好的有限样本性质。