Sarrouti Mourad, Ouatik El Alaoui Said
Laboratory of Computer Science and Modeling, FSDM, Sidi Mohammed Ben Abdellah University, Fez, Morocco.
Laboratory of Computer Science and Modeling, FSDM, Sidi Mohammed Ben Abdellah University, Fez, Morocco.
J Biomed Inform. 2017 Apr;68:96-103. doi: 10.1016/j.jbi.2017.03.001. Epub 2017 Mar 7.
Passage retrieval, the identification of top-ranked passages that may contain the answer for a given biomedical question, is a crucial component for any biomedical question answering (QA) system. Passage retrieval in open-domain QA is a longstanding challenge widely studied over the last decades. However, it still requires further efforts in biomedical QA. In this paper, we present a new biomedical passage retrieval method based on Stanford CoreNLP sentence/passage length, probabilistic information retrieval (IR) model and UMLS concepts.
In the proposed method, we first use our document retrieval system based on PubMed search engine and UMLS similarity to retrieve relevant documents to a given biomedical question. We then take the abstracts from the retrieved documents and use Stanford CoreNLP for sentence splitter to make a set of sentences, i.e., candidate passages. Using stemmed words and UMLS concepts as features for the BM25 model, we finally compute the similarity scores between the biomedical question and each of the candidate passages and keep the N top-ranked ones.
Experimental evaluations performed on large standard datasets, provided by the BioASQ challenge, show that the proposed method achieves good performances compared with the current state-of-the-art methods. The proposed method significantly outperforms the current state-of-the-art methods by an average of 6.84% in terms of mean average precision (MAP).
We have proposed an efficient passage retrieval method which can be used to retrieve relevant passages in biomedical QA systems with high mean average precision.
段落检索,即识别可能包含给定生物医学问题答案的排名靠前的段落,是任何生物医学问答(QA)系统的关键组成部分。开放域QA中的段落检索是过去几十年来广泛研究的一个长期挑战。然而,在生物医学QA中仍需要进一步努力。在本文中,我们提出了一种基于斯坦福CoreNLP句子/段落长度、概率信息检索(IR)模型和统一医学语言系统(UMLS)概念的新型生物医学段落检索方法。
在所提出的方法中,我们首先使用基于PubMed搜索引擎和UMLS相似度的文档检索系统,来检索与给定生物医学问题相关的文档。然后,我们从检索到的文档中提取摘要,并使用斯坦福CoreNLP进行句子拆分,以形成一组句子,即候选段落。我们最终使用词干和UMLS概念作为BM25模型的特征,计算生物医学问题与每个候选段落之间的相似度得分,并保留排名前N的段落。
在BioASQ挑战赛提供的大型标准数据集上进行的实验评估表明,与当前的最先进方法相比,所提出的方法具有良好的性能。在所提出的方法在平均平均精度(MAP)方面比当前的最先进方法显著高出6.84%。
我们提出了一种有效的段落检索方法,该方法可用于在生物医学QA系统中以高平均平均精度检索相关段落。