基于深度学习的语义搜索、问答和摘要生成技术进行的COVID-19信息检索

COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization.

作者信息

Esteva Andre, Kale Anuprit, Paulus Romain, Hashimoto Kazuma, Yin Wenpeng, Radev Dragomir, Socher Richard

机构信息

Salesforce Research, Palo Alto, CA, USA.

Yale University, New Haven, CT, USA.

出版信息

NPJ Digit Med. 2021 Apr 12;4(1):68. doi: 10.1038/s41746-021-00437-0.

DOI:10.1038/s41746-021-00437-0

PMID:33846532

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8041998/

Abstract

The COVID-19 global pandemic has resulted in international efforts to understand, track, and mitigate the disease, yielding a significant corpus of COVID-19 and SARS-CoV-2-related publications across scientific disciplines. Throughout 2020, over 400,000 coronavirus-related publications have been collected through the COVID-19 Open Research Dataset. Here, we present CO-Search, a semantic, multi-stage, search engine designed to handle complex queries over the COVID-19 literature, potentially aiding overburdened health workers in finding scientific answers and avoiding misinformation during a time of crisis. CO-Search is built from two sequential parts: a hybrid semantic-keyword retriever, which takes an input query and returns a sorted list of the 1000 most relevant documents, and a re-ranker, which further orders them by relevance. The retriever is composed of a deep learning model (Siamese-BERT) that encodes query-level meaning, along with two keyword-based models (BM25, TF-IDF) that emphasize the most important words of a query. The re-ranker assigns a relevance score to each document, computed from the outputs of (1) a question-answering module which gauges how much each document answers the query, and (2) an abstractive summarization module which determines how well a query matches a generated summary of the document. To account for the relatively limited dataset, we develop a text augmentation technique which splits the documents into pairs of paragraphs and the citations contained in them, creating millions of (citation title, paragraph) tuples for training the retriever. We evaluate our system ( http://einstein.ai/covid ) on the data of the TREC-COVID information retrieval challenge, obtaining strong performance across multiple key information retrieval metrics.

摘要

新冠疫情全球大流行促使国际社会努力了解、追踪和缓解该疾病，从而产生了大量跨学科的与新冠病毒和严重急性呼吸综合征冠状病毒2（SARS-CoV-2）相关的出版物。在2020年全年，通过新冠病毒开放研究数据集收集了超过40万篇与冠状病毒相关的出版物。在此，我们展示了CO-Search，这是一个语义化、多阶段的搜索引擎，旨在处理关于新冠病毒文献的复杂查询，有可能帮助不堪重负的医护人员在危机时刻找到科学答案并避免错误信息。CO-Search由两个连续部分构建而成：一个混合语义-关键词检索器，它接受输入查询并返回1000篇最相关文档的排序列表，以及一个重新排序器，它进一步按相关性对这些文档进行排序。检索器由一个对查询级含义进行编码的深度学习模型（连体BERT）以及两个强调查询最重要单词的基于关键词的模型（BM25、TF-IDF）组成。重新排序器为每个文档分配一个相关性分数，该分数由以下两个部分的输出计算得出：（1）一个问答模块，用于衡量每个文档对查询的回答程度；（2）一个抽象摘要模块，用于确定查询与文档生成的摘要的匹配程度。为了应对相对有限的数据集，我们开发了一种文本增强技术，该技术将文档拆分为段落对及其包含的引用，创建数百万个（引用标题，段落）元组用于训练检索器。我们在TREC-COVID信息检索挑战赛的数据上评估我们的系统（http://einstein.ai/covid），在多个关键信息检索指标上取得了优异的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3604/8041998/56fe10478edb/41746_2021_437_Fig1_HTML.jpg

相似文献

COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization.

NPJ Digit Med. 2021 Apr 12;4(1):68. doi: 10.1038/s41746-021-00437-0.

Searching COVID-19 Clinical Research Using Graph Queries: Algorithm Development and Validation.

J Med Internet Res. 2024 May 30;26:e52655. doi: 10.2196/52655.

A COVID-19 Search Engine (CO-SE) with Transformer-based architecture.

Healthc Anal (N Y). 2022 Nov;2:100068. doi: 10.1016/j.health.2022.100068. Epub 2022 Jun 6.

How Does ChatGPT Use Source Information Compared With Google? A Text Network Analysis of Online Health Information.

Clin Orthop Relat Res. 2024 Apr 1;482(4):578-588. doi: 10.1097/CORR.0000000000002995. Epub 2024 Mar 1.

Revealing Opinions for COVID-19 Questions Using a Context Retriever, Opinion Aggregator, and Question-Answering Model: Model Development Study.

J Med Internet Res. 2021 Mar 19;23(3):e22860. doi: 10.2196/22860.

COBERT: COVID-19 Question Answering System Using BERT.

Arab J Sci Eng. 2021 Jun 23:1-11. doi: 10.1007/s13369-021-05810-5.

SemBioNLQA: A semantic biomedical question answering system for retrieving exact and ideal answers to natural language questions.

Artif Intell Med. 2020 Jan;102:101767. doi: 10.1016/j.artmed.2019.101767. Epub 2019 Nov 28.

Literature Retrieval for Precision Medicine with Neural Matching and Faceted Summarization.

Proc Conf Empir Methods Nat Lang Process. 2020 Nov;2020:3389-3399. doi: 10.18653/v1/2020.findings-emnlp.304.

Information Retrieval in an Infodemic: The Case of COVID-19 Publications.

J Med Internet Res. 2021 Sep 17;23(9):e30161. doi: 10.2196/30161.

Learning to rank query expansion terms for COVID-19 scholarly search.

J Biomed Inform. 2023 Jun;142:104386. doi: 10.1016/j.jbi.2023.104386. Epub 2023 May 12.

引用本文的文献

AI edge cloud service provisioning for knowledge management smart applications.

Sci Rep. 2025 Sep 1;15(1):32246. doi: 10.1038/s41598-025-14429-7.

Enhancing biomedical named entity recognition with parallel boundary detection and category classification.

BMC Bioinformatics. 2025 Feb 25;26(1):63. doi: 10.1186/s12859-025-06086-4.

Targeting COVID-19 and Human Resources for Health News Information Extraction: Algorithm Development and Validation.

JMIR AI. 2024 Oct 30;3:e55059. doi: 10.2196/55059.

A COVID-19 Search Engine (CO-SE) with Transformer-based architecture.

Healthc Anal (N Y). 2022 Nov;2:100068. doi: 10.1016/j.health.2022.100068. Epub 2022 Jun 6.

Semantic matching based legal information retrieval system for COVID-19 pandemic.

Artif Intell Law (Dordr). 2023 Mar 14:1-30. doi: 10.1007/s10506-023-09354-x.

Leveraging physiology and artificial intelligence to deliver advancements in health care.

Physiol Rev. 2023 Oct 1;103(4):2423-2450. doi: 10.1152/physrev.00033.2022. Epub 2023 Apr 27.

The reproducibility of COVID-19 data analysis: paradoxes, pitfalls, and future challenges.

PNAS Nexus. 2022 Aug 23;1(3):pgac125. doi: 10.1093/pnasnexus/pgac125. eCollection 2022 Jul.

Complex Knowledge Base Question Answering for Intelligent Bridge Management Based on Multi-Task Learning and Cross-Task Constraints.

Entropy (Basel). 2022 Dec 10;24(12):1805. doi: 10.3390/e24121805.

LitCovid ensemble learning for COVID-19 multi-label classification.

Database (Oxford). 2022 Nov 25;2022. doi: 10.1093/database/baac103.

Exploration of biomedical knowledge for recurrent glioblastoma using natural language processing deep learning models.

BMC Med Inform Decis Mak. 2022 Oct 13;22(1):267. doi: 10.1186/s12911-022-02003-4.

本文引用的文献

TREC-COVID: rationale and structure of an information retrieval shared task for COVID-19.

J Am Med Inform Assoc. 2020 Jul 1;27(9):1431-1436. doi: 10.1093/jamia/ocaa091.

Research and Development on Therapeutic Agents and Vaccines for COVID-19 and Related Human Coronavirus Diseases.

ACS Cent Sci. 2020 Mar 25;6(3):315-331. doi: 10.1021/acscentsci.0c00272. Epub 2020 Mar 12.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于深度学习的语义搜索、问答和摘要生成技术进行的COVID-19信息检索

COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization.

作者信息

Esteva Andre, Kale Anuprit, Paulus Romain, Hashimoto Kazuma, Yin Wenpeng, Radev Dragomir, Socher Richard

机构信息

Salesforce Research, Palo Alto, CA, USA.

Yale University, New Haven, CT, USA.

出版信息

NPJ Digit Med. 2021 Apr 12;4(1):68. doi: 10.1038/s41746-021-00437-0.

DOI:10.1038/s41746-021-00437-0

PMID:33846532

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8041998/

Abstract

摘要

基于深度学习的语义搜索、问答和摘要生成技术进行的COVID-19信息检索

COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

基于深度学习的语义搜索、问答和摘要生成技术进行的COVID-19信息检索

COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献