Center for Cognitive Informatics and Decision Making, School of Health Information Sciences, University of Texas, Houston, TX, USA.
J Biomed Inform. 2010 Apr;43(2):240-56. doi: 10.1016/j.jbi.2009.09.003. Epub 2009 Sep 15.
The discovery of implicit connections between terms that do not occur together in any scientific document underlies the model of literature-based knowledge discovery first proposed by Swanson. Corpus-derived statistical models of semantic distance such as Latent Semantic Analysis (LSA) have been evaluated previously as methods for the discovery of such implicit connections. However, LSA in particular is dependent on a computationally demanding method of dimension reduction as a means to obtain meaningful indirect inference, limiting its ability to scale to large text corpora. In this paper, we evaluate the ability of Random Indexing (RI), a scalable distributional model of word associations, to draw meaningful implicit relationships between terms in general and biomedical language. Proponents of this method have achieved comparable performance to LSA on several cognitive tasks while using a simpler and less computationally demanding method of dimension reduction than LSA employs. In this paper, we demonstrate that the original implementation of RI is ineffective at inferring meaningful indirect connections, and evaluate Reflective Random Indexing (RRI), an iterative variant of the method that is better able to perform indirect inference. RRI is shown to lead to more clearly related indirect connections and to outperform existing RI implementations in the prediction of future direct co-occurrence in the MEDLINE corpus.
该模型的基础是隐性关联的发现,这些关联存在于 Swanson 首次提出的基于文献的知识发现模型中,术语之间虽然没有在任何科学文献中同时出现,但存在隐性关联。基于语料库的语义距离统计模型,如潜在语义分析(LSA),之前已被评估为发现此类隐性关联的方法。然而,特别是 LSA 依赖于计算密集型的降维方法,作为获得有意义的间接推理的手段,这限制了它扩展到大型文本语料库的能力。在本文中,我们评估了随机索引(RI)的能力,这是一种可扩展的词项关联分布模型,用于在一般和生物医学语言中提取术语之间的有意义的隐性关系。该方法的支持者在多项认知任务中取得了与 LSA 相当的性能,同时使用了比 LSA 更简单、计算要求更低的降维方法。在本文中,我们证明了原始的 RI 实现无法有效地推断有意义的间接关联,并评估了反射随机索引(RRI),这是该方法的迭代变体,能够更好地进行间接推理。RRI 被证明可以产生更相关的间接关联,并在预测 MEDLINE 语料库中的未来直接共现方面优于现有的 RI 实现。