Department of Electrical, Computer, and Biomedical Engineering, Toronto Metropolitan University, Toronto, Canada.
J Biomed Inform. 2023 Jun;142:104386. doi: 10.1016/j.jbi.2023.104386. Epub 2023 May 12.
With the onset of the Coronavirus Disease 2019 (COVID-19) pandemic, there has been a surge in the number of publicly available biomedical information sources, which makes it an increasingly challenging research goal to retrieve a relevant text to a topic of interest. In this paper, we propose a Contextual Query Expansion framework based on the clinical Domain knowledge (CQED) for formalizing an effective search over PubMed to retrieve relevant COVID-19 scholarly articles to a given information need.
For the sake of training and evaluation, we use the widely adopted TREC-COVID benchmark. Given a query, the proposed framework utilizes a contextual and a domain-specific neural language model to generate a set of candidate query expansion terms that enrich the original query. Moreover, the framework includes a multi-head attention mechanism that is trained alongside a learning-to-rank model for re-ranking the list of generated expansion candidate terms. The original query and the top-ranked expansion terms are posed to the PubMed search engine for retrieving relevant scholarly articles to an information need. The framework, CQED, can have four different variations, depending upon the learning path adopted for training and re-ranking the candidate expansion terms.
The model drastically improves the search performance, when compared to the original query. The performance improvement in comparison to the original query, in terms of RECALL@1000 is 190.85% and in terms of NDCG@1000 is 343.55%. Additionally, the model outperforms all existing state-of-the-art baselines. In terms of P@10, the model that has been optimized based on Precision outperforms all baselines (0.7987). On the other hand, in terms of NDCG@10 (0.7986), MAP (0.3450) and bpref (0.4900), the CQED model that has been optimized based on an average of all retrieval measures outperforms all the baselines.
The proposed model successfully expands queries posed to PubMed, and improves search performance, as compared to all existing baselines. A success/failure analysis shows that the model improved the search performance of each of the evaluated queries. Moreover, an ablation study depicted that if ranking of generated candidate terms is not conducted, the overall performance decreases. For future work, we would like to explore the application of the presented query expansion framework in conducting technology-assisted Systematic Literature Reviews (SLR).
随着 2019 年冠状病毒病(COVID-19)大流行的爆发,公开提供的生物医学信息源数量激增,因此,检索与感兴趣的主题相关的文本成为一个极具挑战性的研究目标。在本文中,我们提出了一种基于临床领域知识(CQED)的上下文查询扩展框架,用于对 PubMed 进行有效搜索,以检索与给定信息需求相关的 COVID-19 学术文章。
为了培训和评估,我们使用了广泛采用的 TREC-COVID 基准。给定一个查询,所提出的框架利用上下文和特定于领域的神经语言模型来生成一组候选查询扩展项,从而丰富原始查询。此外,该框架包括一个多头注意力机制,该机制与学习排名模型一起进行训练,以重新对生成的扩展候选术语列表进行排名。原始查询和排名最高的扩展项被提交给 PubMed 搜索引擎,以检索与信息需求相关的学术文章。CQED 框架可以根据训练和重新对候选扩展项进行排名的学习路径有四种不同的变体。
与原始查询相比,该模型极大地提高了搜索性能。与原始查询相比,在召回率@1000 方面的性能提高了 190.85%,在 NDCG@1000 方面的性能提高了 343.55%。此外,该模型优于所有现有的最先进的基线。在 P@10 方面,基于精度进行优化的模型优于所有基线(0.7987)。另一方面,在 NDCG@10(0.7986)、MAP(0.3450)和 bpref(0.4900)方面,基于所有检索措施的平均值进行优化的 CQED 模型优于所有基线。
与所有现有的基线相比,所提出的模型成功地扩展了提交给 PubMed 的查询,并提高了搜索性能。成功/失败分析表明,该模型提高了评估的每个查询的搜索性能。此外,一项消融研究表明,如果不对生成的候选术语进行排名,整体性能将会下降。未来的工作,我们将探索在进行技术辅助的系统文献综述(SLR)时应用所提出的查询扩展框架。