Esteva Andre, Kale Anuprit, Paulus Romain, Hashimoto Kazuma, Yin Wenpeng, Radev Dragomir, Socher Richard
Salesforce Research, Palo Alto, CA, USA.
Yale University, New Haven, CT, USA.
NPJ Digit Med. 2021 Apr 12;4(1):68. doi: 10.1038/s41746-021-00437-0.
The COVID-19 global pandemic has resulted in international efforts to understand, track, and mitigate the disease, yielding a significant corpus of COVID-19 and SARS-CoV-2-related publications across scientific disciplines. Throughout 2020, over 400,000 coronavirus-related publications have been collected through the COVID-19 Open Research Dataset. Here, we present CO-Search, a semantic, multi-stage, search engine designed to handle complex queries over the COVID-19 literature, potentially aiding overburdened health workers in finding scientific answers and avoiding misinformation during a time of crisis. CO-Search is built from two sequential parts: a hybrid semantic-keyword retriever, which takes an input query and returns a sorted list of the 1000 most relevant documents, and a re-ranker, which further orders them by relevance. The retriever is composed of a deep learning model (Siamese-BERT) that encodes query-level meaning, along with two keyword-based models (BM25, TF-IDF) that emphasize the most important words of a query. The re-ranker assigns a relevance score to each document, computed from the outputs of (1) a question-answering module which gauges how much each document answers the query, and (2) an abstractive summarization module which determines how well a query matches a generated summary of the document. To account for the relatively limited dataset, we develop a text augmentation technique which splits the documents into pairs of paragraphs and the citations contained in them, creating millions of (citation title, paragraph) tuples for training the retriever. We evaluate our system ( http://einstein.ai/covid ) on the data of the TREC-COVID information retrieval challenge, obtaining strong performance across multiple key information retrieval metrics.
新冠疫情全球大流行促使国际社会努力了解、追踪和缓解该疾病,从而产生了大量跨学科的与新冠病毒和严重急性呼吸综合征冠状病毒2(SARS-CoV-2)相关的出版物。在2020年全年,通过新冠病毒开放研究数据集收集了超过40万篇与冠状病毒相关的出版物。在此,我们展示了CO-Search,这是一个语义化、多阶段的搜索引擎,旨在处理关于新冠病毒文献的复杂查询,有可能帮助不堪重负的医护人员在危机时刻找到科学答案并避免错误信息。CO-Search由两个连续部分构建而成:一个混合语义-关键词检索器,它接受输入查询并返回1000篇最相关文档的排序列表,以及一个重新排序器,它进一步按相关性对这些文档进行排序。检索器由一个对查询级含义进行编码的深度学习模型(连体BERT)以及两个强调查询最重要单词的基于关键词的模型(BM25、TF-IDF)组成。重新排序器为每个文档分配一个相关性分数,该分数由以下两个部分的输出计算得出:(1)一个问答模块,用于衡量每个文档对查询的回答程度;(2)一个抽象摘要模块,用于确定查询与文档生成的摘要的匹配程度。为了应对相对有限的数据集,我们开发了一种文本增强技术,该技术将文档拆分为段落对及其包含的引用,创建数百万个(引用标题,段落)元组用于训练检索器。我们在TREC-COVID信息检索挑战赛的数据上评估我们的系统(http://einstein.ai/covid),在多个关键信息检索指标上取得了优异的性能。