海量文档集的自组织

Self organization of a massive document collection.

作者信息

Kohonen T, Kaski S, Lagus K, Salojarvi J, Honkela J, Paatero V, Saarela A

机构信息

Neural Networks Research Centre, Helsinki University of Technology, Espoo, Finland.

出版信息

IEEE Trans Neural Netw. 2000;11(3):574-85. doi: 10.1109/72.846729.

DOI:10.1109/72.846729

PMID:18249786

Abstract

This article describes the implementation of a system that is able to organize vast document collections according to textual similarities. It is based on the self-organizing map (SOM) algorithm. As the feature vectors for the documents statistical representations of their vocabularies are used. The main goal in our work has been to scale up the SOM algorithm to be able to deal with large amounts of high-dimensional data. In a practical experiment we mapped 6,840,568 patent abstracts onto a 1,002,240-node SOM. As the feature vectors we used 500-dimensional vectors of stochastic figures obtained as random projections of weighted word histograms.

摘要

本文描述了一个能够根据文本相似度来组织大量文档集合的系统的实现。它基于自组织映射（SOM）算法。文档的特征向量采用其词汇的统计表示。我们工作的主要目标是扩展SOM算法，使其能够处理大量的高维数据。在一个实际实验中，我们将6,840,568篇专利摘要映射到一个拥有1,002,240个节点的SOM上。作为特征向量，我们使用了通过加权词直方图的随机投影获得的500维随机数向量。