The Allen Institute for Artificial Intelligence, Seattle, WA 98112, USA.
Brief Bioinform. 2021 Mar 22;22(2):781-799. doi: 10.1093/bib/bbaa296.
More than 50 000 papers have been published about COVID-19 since the beginning of 2020 and several hundred new papers continue to be published every day. This incredible rate of scientific productivity leads to information overload, making it difficult for researchers, clinicians and public health officials to keep up with the latest findings. Automated text mining techniques for searching, reading and summarizing papers are helpful for addressing information overload. In this review, we describe the many resources that have been introduced to support text mining applications over the COVID-19 literature; specifically, we discuss the corpora, modeling resources, systems and shared tasks that have been introduced for COVID-19. We compile a list of 39 systems that provide functionality such as search, discovery, visualization and summarization over the COVID-19 literature. For each system, we provide a qualitative description and assessment of the system's performance, unique data or user interface features and modeling decisions. Many systems focus on search and discovery, though several systems provide novel features, such as the ability to summarize findings over multiple documents or linking between scientific articles and clinical trials. We also describe the public corpora, models and shared tasks that have been introduced to help reduce repeated effort among community members; some of these resources (especially shared tasks) can provide a basis for comparing the performance of different systems. Finally, we summarize promising results and open challenges for text mining the COVID-19 literature.
自 2020 年初以来,已经发表了超过 50000 篇关于 COVID-19 的论文,并且每天仍有数百篇新论文不断发表。这种令人难以置信的科学生产力导致了信息过载,使得研究人员、临床医生和公共卫生官员难以跟上最新的发现。用于搜索、阅读和总结论文的自动化文本挖掘技术有助于解决信息过载问题。在这篇综述中,我们描述了许多已经引入的资源,以支持 COVID-19 文献的文本挖掘应用;具体来说,我们讨论了为 COVID-19 引入的语料库、建模资源、系统和共享任务。我们编制了一份包含 39 个系统的列表,这些系统提供了针对 COVID-19 文献的搜索、发现、可视化和总结等功能。对于每个系统,我们提供了对系统性能、独特数据或用户界面功能以及建模决策的定性描述和评估。许多系统专注于搜索和发现,尽管有几个系统提供了新颖的功能,例如能够总结多个文档中的发现或在科学文章和临床试验之间建立链接。我们还描述了为帮助社区成员减少重复工作而引入的公共语料库、模型和共享任务;其中一些资源(特别是共享任务)可以为比较不同系统的性能提供基础。最后,我们总结了挖掘 COVID-19 文献的文本挖掘的有希望的结果和开放挑战。