一种用于生物医学信息检索的词性术语加权方案。

A Part-Of-Speech term weighting scheme for biomedical information retrieval.

作者信息

Wang Yanshan, Wu Stephen, Li Dingcheng, Mehrabi Saeed, Liu Hongfang

机构信息

Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.

Department of Medical Informatics & Clinical Epidemiology, Oregon Health and Science University, Portland, OR, USA.

出版信息

J Biomed Inform. 2016 Oct;63:379-389. doi: 10.1016/j.jbi.2016.08.026. Epub 2016 Sep 1.

Abstract

In the era of digitalization, information retrieval (IR), which retrieves and ranks documents from large collections according to users' search queries, has been popularly applied in the biomedical domain. Building patient cohorts using electronic health records (EHRs) and searching literature for topics of interest are some IR use cases. Meanwhile, natural language processing (NLP), such as tokenization or Part-Of-Speech (POS) tagging, has been developed for processing clinical documents or biomedical literature. We hypothesize that NLP can be incorporated into IR to strengthen the conventional IR models. In this study, we propose two NLP-empowered IR models, POS-BoW and POS-MRF, which incorporate automatic POS-based term weighting schemes into bag-of-word (BoW) and Markov Random Field (MRF) IR models, respectively. In the proposed models, the POS-based term weights are iteratively calculated by utilizing a cyclic coordinate method where golden section line search algorithm is applied along each coordinate to optimize the objective function defined by mean average precision (MAP). In the empirical experiments, we used the data sets from the Medical Records track in Text REtrieval Conference (TREC) 2011 and 2012 and the Genomics track in TREC 2004. The evaluation on TREC 2011 and 2012 Medical Records tracks shows that, for the POS-BoW models, the mean improvement rates for IR evaluation metrics, MAP, bpref, and P@10, are 10.88%, 4.54%, and 3.82%, compared to the BoW models; and for the POS-MRF models, these rates are 13.59%, 8.20%, and 8.78%, compared to the MRF models. Additionally, we experimentally verify that the proposed weighting approach is superior to the simple heuristic and frequency based weighting approaches, and validate our POS category selection. Using the optimal weights calculated in this experiment, we tested the proposed models on the TREC 2004 Genomics track and obtained average of 8.63% and 10.04% improvement rates for POS-BoW and POS-MRF, respectively. These significant improvements verify the effectiveness of leveraging POS tagging for biomedical IR tasks.

摘要

在数字化时代,信息检索(IR),即根据用户的搜索查询从大量文档集合中检索文档并进行排序,已在生物医学领域得到广泛应用。使用电子健康记录(EHR)构建患者队列以及搜索感兴趣主题的文献是一些信息检索的用例。同时,自然语言处理(NLP),如词法分析或词性(POS)标注,已被开发用于处理临床文档或生物医学文献。我们假设可以将自然语言处理纳入信息检索以强化传统的信息检索模型。在本研究中,我们提出了两种由自然语言处理赋能的信息检索模型,即词性词袋模型(POS-BoW)和词性马尔可夫随机场模型(POS-MRF),它们分别将基于自动词性的词项加权方案纳入词袋(BoW)和马尔可夫随机场(MRF)信息检索模型。在所提出的模型中,基于词性的词项权重通过利用循环坐标法迭代计算,其中沿着每个坐标应用黄金分割线搜索算法来优化由平均准确率(MAP)定义的目标函数。在实证实验中,我们使用了2011年和2012年文本检索会议(TREC)医疗记录赛道以及2004年TREC基因组赛道的数据集。对2011年和2012年医疗记录赛道的评估表明,对于词性词袋模型,与词袋模型相比,信息检索评估指标MAP、bpref和P@10的平均提升率分别为10.88%、4.54%和3.82%;对于词性马尔可夫随机场模型,与马尔可夫随机场模型相比,这些提升率分别为13.59%、8.20%和8.78%。此外,我们通过实验验证了所提出的加权方法优于简单的启发式和基于频率的加权方法,并验证了我们的词性类别选择。使用本实验中计算出的最优权重,我们在2004年TREC基因组赛道上测试了所提出的模型,词性词袋模型和词性马尔可夫随机场模型的平均提升率分别为8.63%和10.04%。这些显著的提升验证了利用词性标注进行生物医学信息检索任务的有效性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索