Landauer Thomas K, Laham Darrell, Derr Marcia
Department of Psychology, University of Colorado, Boulder, CO 80309-0345, USA.
Proc Natl Acad Sci U S A. 2004 Apr 6;101 Suppl 1(Suppl 1):5214-9. doi: 10.1073/pnas.0400341101. Epub 2004 Mar 22.
Most techniques for relating textual information rely on intellectually created links such as author-chosen keywords and titles, authority indexing terms, or bibliographic citations. Similarity of the semantic content of whole documents, rather than just titles, abstracts, or overlap of keywords, offers an attractive alternative. Latent semantic analysis provides an effective dimension reduction method for the purpose that reflects synonymy and the sense of arbitrary word combinations. However, latent semantic analysis correlations with human text-to-text similarity judgments are often empirically highest at approximately 300 dimensions. Thus, two- or three-dimensional visualizations are severely limited in what they can show, and the first and/or second automatically discovered principal component, or any three such for that matter, rarely capture all of the relations that might be of interest. It is our conjecture that linguistic meaning is intrinsically and irreducibly very high dimensional. Thus, some method to explore a high dimensional similarity space is needed. But the 2.7 x 10(7) projections and infinite rotations of, for example, a 300-dimensional pattern are impossible to examine. We suggest, however, that the use of a high dimensional dynamic viewer with an effective projection pursuit routine and user control, coupled with the exquisite abilities of the human visual system to extract information about objects and from moving patterns, can often succeed in discovering multiple revealing views that are missed by current computational algorithms. We show some examples of the use of latent semantic analysis to support such visualizations and offer views on future needs.
大多数关联文本信息的技术依赖于人为创建的链接,如作者选择的关键词和标题、权威索引词或文献引用。与仅考虑标题、摘要或关键词重叠不同,整个文档语义内容的相似性提供了一种有吸引力的替代方法。潜在语义分析提供了一种有效的降维方法,用于反映同义词和任意词组合的语义。然而,潜在语义分析与人类文本间相似性判断的相关性在大约300维时通常在经验上是最高的。因此,二维或三维可视化在所能展示的内容方面受到严重限制,并且自动发现的第一和/或第二主成分,或者就此而言的任何三个主成分,很少能捕捉到所有可能感兴趣的关系。我们推测语言意义本质上是高维且不可简化的。因此,需要某种方法来探索高维相似性空间。但是,例如一个300维模式的2.7×10⁷个投影和无限旋转是无法检验的。然而,我们建议使用具有有效投影追踪程序和用户控制的高维动态查看器,再结合人类视觉系统从移动模式中提取物体信息的卓越能力,通常能够成功发现当前计算算法所遗漏的多个有启发性的视图。我们展示了一些使用潜在语义分析来支持此类可视化的示例,并对未来需求提出了看法。