Menczer Filippo
School of Informatics, Indiana University, Bloomington, IN 47408, USA.
Proc Natl Acad Sci U S A. 2004 Apr 6;101 Suppl 1(Suppl 1):5261-5. doi: 10.1073/pnas.0307554100. Epub 2004 Jan 27.
How does a network of documents grow without centralized control? This question is becoming crucial as we try to explain the emergent scale-free topology of the World Wide Web and use link analysis to identify important information resources. Existing models of growing information networks have focused on the structure of links but neglected the content of nodes. Here I show that the current models fail to reproduce a critical characteristic of information networks, namely the distribution of textual similarity among linked documents. I propose a more realistic model that generates links by using both popularity and content. This model yields remarkably accurate predictions of both degree and similarity distributions in networks of web pages and scientific literature.
一个没有集中控制的文档网络是如何增长的?随着我们试图解释万维网出现的无标度拓扑结构并使用链接分析来识别重要信息资源,这个问题变得至关重要。现有的信息网络增长模型侧重于链接结构,却忽略了节点的内容。在这里我表明,当前的模型无法再现信息网络的一个关键特征,即链接文档之间文本相似度的分布。我提出了一个更现实的模型,该模型通过同时使用流行度和内容来生成链接。这个模型对网页和科学文献网络中的度分布和相似度分布都产生了非常准确的预测。