Magge Arjun, Weissenbacher Davy, Sarker Abeed, Scotch Matthew, Gonzalez-Hernandez Graciela
College of Health Solutions, Arizona State University, Tempe, AZ 85281, USA2Biodesign Center for Environmental Health Engineering, Arizona State University, Tempe, AZ 85281, USA.
Pac Symp Biocomput. 2019;24:100-111.
Phylogeography research involving virus spread and tree reconstruction relies on accurate geographic locations of infected hosts. Insufficient level of geographic information in nucleotide sequence repositories such as GenBank motivates the use of natural language processing methods for extracting geographic location names (toponyms) in the scientific article associated with the sequence, and disambiguating the locations to their co-ordinates. In this paper, we present an extensive study of multiple recurrent neural network architectures for the task of extracting geographic locations and their effective contribution to the disambiguation task using population heuristics. The methods presented in this paper achieve a strict detection F1 score of 0.94, disambiguation accuracy of 91% and an overall resolution F1 score of 0.88 that are significantly higher than previously developed methods, improving our capability to find the location of infected hosts and enrich metadata information.
涉及病毒传播和系统发育树重建的系统发育地理学研究依赖于受感染宿主的准确地理位置。诸如GenBank等核苷酸序列数据库中地理信息水平不足,这促使人们使用自然语言处理方法来提取与序列相关的科学文章中的地理位置名称(地名),并将这些位置的坐标进行消歧。在本文中,我们对多种循环神经网络架构进行了广泛研究,以完成提取地理位置的任务,并利用群体启发式算法研究它们对消歧任务的有效贡献。本文提出的方法实现了严格检测F1分数为0.94、消歧准确率为91%以及整体分辨率F1分数为0.88,这些分数显著高于先前开发的方法,提高了我们找到受感染宿主位置和丰富元数据信息的能力。