Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN, United States.
Department of Linguistics, The Ohio State University, Columbus, OH, United States.
J Med Internet Res. 2021 Mar 19;23(3):e22860. doi: 10.2196/22860.
COVID-19 has challenged global public health because it is highly contagious and can be lethal. Numerous ongoing and recently published studies about the disease have emerged. However, the research regarding COVID-19 is largely ongoing and inconclusive.
A potential way to accelerate COVID-19 research is to use existing information gleaned from research into other viruses that belong to the coronavirus family. Our objective is to develop a natural language processing method for answering factoid questions related to COVID-19 using published articles as knowledge sources.
Given a question, first, a BM25-based context retriever model is implemented to select the most relevant passages from previously published articles. Second, for each selected context passage, an answer is obtained using a pretrained bidirectional encoder representations from transformers (BERT) question-answering model. Third, an opinion aggregator, which is a combination of a biterm topic model and k-means clustering, is applied to the task of aggregating all answers into several opinions.
We applied the proposed pipeline to extract answers, opinions, and the most frequent words related to six questions from the COVID-19 Open Research Dataset Challenge. By showing the longitudinal distributions of the opinions, we uncovered the trends of opinions and popular words in the articles published in the five time periods assessed: before 1990, 1990-1999, 2000-2009, 2010-2018, and since 2019. The changes in opinions and popular words agree with several distinct characteristics and challenges of COVID-19, including a higher risk for senior people and people with pre-existing medical conditions; high contagion and rapid transmission; and a more urgent need for screening and testing. The opinions and popular words also provide additional insights for the COVID-19-related questions.
Compared with other methods of literature retrieval and answer generation, opinion aggregation using our method leads to more interpretable, robust, and comprehensive question-specific literature reviews. The results demonstrate the usefulness of the proposed method in answering COVID-19-related questions with main opinions and capturing the trends of research about COVID-19 and other relevant strains of coronavirus in recent years.
COVID-19 对全球公共卫生构成了挑战,因为它具有高度传染性并且可能致命。目前有许多正在进行和最近发表的关于该疾病的研究。然而,关于 COVID-19 的研究在很大程度上仍在进行中,尚无定论。
加速 COVID-19 研究的一种潜在方法是利用从属于冠状病毒家族的其他病毒的研究中获取的现有信息。我们的目标是开发一种自然语言处理方法,使用已发表的文章作为知识来源,回答与 COVID-19 相关的事实问题。
给定一个问题,首先,我们实施基于 BM25 的上下文检索模型,从先前发表的文章中选择最相关的段落。其次,对于每个选定的上下文段落,我们使用预先训练的基于转换器的双向编码器表示(BERT)问答模型来获取答案。然后,我们应用一种观点聚合器,它是双词主题模型和 K 均值聚类的组合,用于将所有答案聚合为几个观点的任务。
我们将提出的管道应用于从 COVID-19 开放研究数据集挑战赛中提取六个问题的答案、观点和最常出现的单词。通过展示观点的纵向分布,我们揭示了在评估的五个时间段(1990 年前、1990-1999 年、2000-2009 年、2010-2018 年和 2019 年以来)发表的文章中观点和流行词汇的趋势。观点和流行词汇的变化与 COVID-19 的几个明显特征和挑战相符,包括老年人和有预先存在的医疗条件的人风险更高;高传染性和快速传播;以及更迫切需要筛查和检测。这些观点和流行词汇也为与 COVID-19 相关的问题提供了更多的见解。
与其他文献检索和答案生成方法相比,使用我们的方法进行观点聚合可生成更具可解释性、稳健性和全面性的特定于问题的文献综述。结果表明,该方法在回答与 COVID-19 相关的问题方面具有实用性,并且可以捕捉近年来 COVID-19 和其他相关冠状病毒株的研究趋势。