Weissenbacher Davy, O'Connor Karen, Klein Ari, Golder Su, Flores Ivan, Elyaderani Amir, Scotch Matthew, Gonzalez-Hernandez Graciela
Cedars-Sinai Medical Center, Los Angeles, CA, USA.
University of Pennsylvania, Philadelphia, PA, USA.
medRxiv. 2023 Aug 4:2023.07.29.23293370. doi: 10.1101/2023.07.29.23293370.
There are many studies that require researchers to extract specific information from the published literature, such as details about sequence records or about a randomized control trial. While manual extraction is cost efficient for small studies, larger studies such as systematic reviews are much more costly and time-consuming. To avoid exhaustive manual searches and extraction, and their related cost and effort, natural language processing (NLP) methods can be tailored for the more subtle extraction and decision tasks that typically only humans have performed. The need for such studies that use the published literature as a data source became even more evident as the COVID-19 pandemic raged through the world and millions of sequenced samples were deposited in public repositories such as GISAID and GenBank, promising large genomic epidemiology studies, but more often than not lacked many important details that prevented large-scale studies. Thus, granular geographic location or the most basic patient-relevant data such as demographic information, or clinical outcomes were not noted in the sequence record. However, some of these data was indeed published, but in the text, tables, or supplementary material of a corresponding published article. We present here methods to identify relevant journal articles that report having produced and made available in GenBank or GISAID, new SARS-CoV-2 sequences, as those that initially produced and made available the sequences are the most likely articles to include the high-level details about the patients from whom the sequences were obtained. Human annotators validated the approach, creating a gold standard set for training and validation of a machine learning classifier. Identifying these articles is a crucial step to enable future automated informatics pipelines that will apply Machine Learning and Natural Language Processing to identify patient characteristics such as co-morbidities, outcomes, age, gender, and race, enriching SARS-CoV-2 sequence databases with actionable information for defining large genomic epidemiology studies. Thus, enriched patient metadata can enable secondary data analysis, at scale, to uncover associations between the viral genome (including variants of concern and their sublineages), transmission risk, and health outcomes. However, for such enrichment to happen, the right papers need to be found and very detailed data needs to be extracted from them. Further, finding the very specific articles needed for inclusion is a task that also facilitates scoping and systematic reviews, greatly reducing the time needed for full-text analysis and extraction.
有许多研究要求研究人员从已发表的文献中提取特定信息,例如序列记录的详细信息或随机对照试验的详细信息。虽然手动提取对于小型研究来说成本效益高,但对于大型研究,如系统评价,则成本更高且耗时更长。为了避免详尽的手动搜索和提取及其相关的成本和工作量,可以针对通常只有人类才能执行的更精细的提取和决策任务量身定制自然语言处理(NLP)方法。随着新冠疫情在全球肆虐,数百万个测序样本被存入GISAID和GenBank等公共数据库,有望开展大型基因组流行病学研究,但往往缺少许多重要细节,从而阻碍了大规模研究。因此,在序列记录中未注明具体的地理位置或最基本的与患者相关的数据,如人口统计信息或临床结果。然而,其中一些数据确实已发表,但在相应已发表文章的正文、表格或补充材料中。我们在此介绍一些方法,用于识别那些报告已在GenBank或GISAID中生成并提供新的新冠病毒序列的相关期刊文章,因为最初生成并提供这些序列的文章最有可能包含有关获取序列的患者的详细信息。人工注释者对该方法进行了验证,创建了一个用于训练和验证机器学习分类器的黄金标准集。识别这些文章是实现未来自动化信息管道的关键一步,该管道将应用机器学习和自然语言处理来识别患者特征,如合并症、结局、年龄、性别和种族,用可操作的信息丰富新冠病毒序列数据库,以定义大型基因组流行病学研究。因此,丰富的患者元数据能够实现大规模的二次数据分析,以揭示病毒基因组(包括关注的变体及其亚谱系)、传播风险和健康结局之间的关联。然而,为了实现这种丰富,需要找到合适的论文,并从其中提取非常详细的数据。此外,找到纳入所需的非常具体的文章也是一项有助于界定范围和进行系统评价的任务,大大减少了全文分析和提取所需的时间。