Mrabet Yassine, Kilicoglu Halil, Roberts Kirk, Demner-Fushman Dina
Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, Bethesda, MD, USA.
University of Texas Health Science Center at Houston, Houston, TX, USA.
AMIA Annu Symp Proc. 2017 Feb 10;2016:914-923. eCollection 2016.
Determining the main topics in consumer health questions is a crucial step in their processing as it allows narrowing the search space to a specific semantic context. In this paper we propose a topic recognition approach based on biomedical and open-domain knowledge bases. In the first step of our method, we recognize named entities in consumer health questions using an unsupervised method that relies on a biomedical knowledge base, UMLS, and an open-domain knowledge base, DBpedia. In the next step, we cast topic recognition as a binary classification problem of deciding whether a named entity is the question topic or not. We evaluated our approach on a dataset from the National Library of Medicine (NLM), introduced in this paper, and another from the Genetic and Rare Disease Information Center (GARD). The combination of knowledge bases outperformed the results obtained by individual knowledge bases by up to 16.5% F1 and achieved state-of-the-art performance. Our results demonstrate that combining open-domain knowledge bases with biomedical knowledge bases can lead to a substantial improvement in understanding user-generated health content.
确定消费者健康问题中的主要主题是处理这些问题的关键步骤,因为它可以将搜索空间缩小到特定的语义上下文。在本文中,我们提出了一种基于生物医学和开放域知识库的主题识别方法。在我们方法的第一步中,我们使用一种无监督方法识别消费者健康问题中的命名实体,该方法依赖于生物医学知识库UMLS和开放域知识库DBpedia。在下一步中,我们将主题识别转换为一个二元分类问题,即决定一个命名实体是否为问题主题。我们在本文介绍的来自美国国立医学图书馆(NLM)的数据集以及来自遗传和罕见病信息中心(GARD)的另一个数据集上评估了我们的方法。知识库的组合在F1值上比单个知识库获得的结果高出16.5%,并达到了当前的最佳性能。我们的结果表明,将开放域知识库与生物医学知识库相结合可以显著提高对用户生成的健康内容的理解。