Department of Computer Science, University of Colorado, Boulder, CO 80309;
Information Sciences Institute, University of Southern California, Marina del Rey, CA 90292.
Proc Natl Acad Sci U S A. 2020 Sep 22;117(38):23393-23400. doi: 10.1073/pnas.1914950117. Epub 2020 Sep 4.
Most real-world networks are incompletely observed. Algorithms that can accurately predict which links are missing can dramatically speed up network data collection and improve network model validation. Many algorithms now exist for predicting missing links, given a partially observed network, but it has remained unknown whether a single best predictor exists, how link predictability varies across methods and networks from different domains, and how close to optimality current methods are. We answer these questions by systematically evaluating 203 individual link predictor algorithms, representing three popular families of methods, applied to a large corpus of 550 structurally diverse networks from six scientific domains. We first show that individual algorithms exhibit a broad diversity of prediction errors, such that no one predictor or family is best, or worst, across all realistic inputs. We then exploit this diversity using network-based metalearning to construct a series of "stacked" models that combine predictors into a single algorithm. Applied to a broad range of synthetic networks, for which we may analytically calculate optimal performance, these stacked models achieve optimal or nearly optimal levels of accuracy. Applied to real-world networks, stacked models are superior, but their accuracy varies strongly by domain, suggesting that link prediction may be fundamentally easier in social networks than in biological or technological networks. These results indicate that the state of the art for link prediction comes from combining individual algorithms, which can achieve nearly optimal predictions. We close with a brief discussion of limitations and opportunities for further improvements.
大多数真实世界的网络都是不完全观测到的。能够准确预测哪些链接缺失的算法可以显著加快网络数据收集和改进网络模型验证。现在存在许多用于预测缺失链接的算法,给定一个部分观测到的网络,但仍然不知道是否存在单个最佳预测器,不同方法和来自不同领域的网络的链接可预测性如何变化,以及当前方法接近最优的程度。我们通过系统地评估 203 个单独的链接预测算法,代表三种流行的方法家族,应用于来自六个科学领域的 550 个结构多样的网络的大型语料库,回答了这些问题。我们首先表明,个别算法表现出广泛的预测误差多样性,因此在所有现实输入中,没有一个预测器或方法家族是最好的或最差的。然后,我们利用基于网络的元学习利用这种多样性,构建一系列将预测器组合成单个算法的“堆叠”模型。将这些堆叠模型应用于广泛的合成网络,对于这些网络,我们可以通过分析计算出最佳性能,这些堆叠模型可以达到最佳或几乎最佳的准确性水平。将这些堆叠模型应用于真实网络,它们的性能优于传统模型,但准确性因领域而异,这表明在社交网络中,链接预测可能比在生物或技术网络中更基本。这些结果表明,链接预测的最新技术来自于结合单个算法,这些算法可以实现几乎最佳的预测。最后,我们简要讨论了进一步改进的限制和机会。