Percha Bethany, Altman Russ B
Stanford University, Stanford, CA.
AMIA Annu Symp Proc. 2013 Nov 16;2013:1123-32. eCollection 2013.
The biomedical literature presents a uniquely challenging text mining problem. Sentences are long and complex, the subject matter is highly specialized with a distinct vocabulary, and producing annotated training data for this domain is time consuming and expensive. In this environment, unsupervised text mining methods that do not rely on annotated training data are valuable. Here we investigate the use of random indexing, an automated method for producing vector-space semantic representations of words from large, unlabeled corpora, to address the problem of term normalization in sentences describing drugs and genes. We show that random indexing produces similarity scores that capture some of the structure of PHARE, a manually curated ontology of pharmacogenomics concepts. We further show that random indexing can be used to identify likely word candidates for inclusion in the ontology, and can help localize these new labels among classes and roles within the ontology.
生物医学文献提出了一个极具挑战性的文本挖掘问题。句子冗长复杂,主题高度专业化且有独特的词汇,为该领域生成带注释的训练数据既耗时又昂贵。在这种环境下,不依赖带注释训练数据的无监督文本挖掘方法很有价值。在此,我们研究随机索引的应用,这是一种从大型未标记语料库生成单词向量空间语义表示的自动化方法,以解决描述药物和基因的句子中的术语规范化问题。我们表明,随机索引产生的相似度得分能够捕捉PHARE(一个人工策划的药物基因组学概念本体)的一些结构。我们进一步表明,随机索引可用于识别可能纳入本体的单词候选词,并有助于在本体中的类别和角色之间定位这些新标签。