Crain Steven P, Yang Shuang-Hong, Zha Hongyuan, Jiao Yu
Georgia Institute of Technology, Atlanta, GA.
AMIA Annu Symp Proc. 2010 Nov 13;2010:132-6.
Access to health information by consumers is hampered by a fundamental language gap. Current attempts to close the gap leverage consumer oriented health information, which does not, however, have good coverage of slang medical terminology. In this paper, we present a Bayesian model to automatically align documents with different dialects (slang, common and technical) while extracting their semantic topics. The proposed diaTM model enables effective information retrieval, even when the query contains slang words, by explicitly modeling the mixtures of dialects in documents and the joint influence of dialects and topics on word selection. Simulations using consumer questions to retrieve medical information from a corpus of medical documents show that diaTM achieves a 25% improvement in information retrieval relevance by nDCG@5 over an LDA baseline.
消费者获取健康信息受到基本语言差距的阻碍。当前缩小这一差距的尝试利用了面向消费者的健康信息,然而,这类信息对医学俚语术语的覆盖并不理想。在本文中,我们提出了一种贝叶斯模型,用于在提取不同方言(俚语、通用语和专业语)文档的语义主题时,自动将它们对齐。所提出的方言主题模型(diaTM)通过明确对文档中的方言混合以及方言和主题对单词选择的联合影响进行建模,即使查询中包含俚语单词,也能实现有效的信息检索。使用消费者问题从医学文档语料库中检索医学信息的模拟表明,与潜在狄利克雷分配(LDA)基线相比,diaTM在信息检索相关性方面,通过归一化折损累计增益(nDCG)@5指标实现了25%的提升。