Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ, USA.
Biodesign Center for Environmental Health Engineering, Biodesign Institute, Arizona State University, Tempe, AZ, USA.
Bioinformatics. 2018 Jul 1;34(13):i565-i573. doi: 10.1093/bioinformatics/bty273.
Virus phylogeographers rely on DNA sequences of viruses and the locations of the infected hosts found in public sequence databases like GenBank for modeling virus spread. However, the locations in GenBank records are often only at the country or state level, and may require phylogeographers to scan the journal articles associated with the records to identify more localized geographic areas. To automate this process, we present a named entity recognizer (NER) for detecting locations in biomedical literature. We built the NER using a deep feedforward neural network to determine whether a given token is a toponym or not. To overcome the limited human annotated data available for training, we use distant supervision techniques to generate additional samples to train our NER.
Our NER achieves an F1-score of 0.910 and significantly outperforms the previous state-of-the-art system. Using the additional data generated through distant supervision further boosts the performance of the NER achieving an F1-score of 0.927. The NER presented in this research improves over previous systems significantly. Our experiments also demonstrate the NER's capability to embed external features to further boost the system's performance. We believe that the same methodology can be applied for recognizing similar biomedical entities in scientific literature.
病毒系统发生地理学家依赖于病毒的 DNA 序列以及在 GenBank 等公共序列数据库中发现的受感染宿主的位置来对病毒传播进行建模。然而,GenBank 记录中的位置通常仅在国家或州一级,可能需要系统发生地理学家扫描与记录相关的期刊文章以确定更本地化的地理区域。为了自动化这个过程,我们提出了一种用于在生物医学文献中检测位置的命名实体识别器 (NER)。我们使用深度前馈神经网络构建了 NER,以确定给定标记是否是地名。为了克服可用于训练的有限人工标注数据,我们使用远程监督技术生成额外的样本来训练我们的 NER。
我们的 NER 达到了 0.910 的 F1 分数,明显优于以前的最先进系统。通过远程监督生成的额外数据进一步提高了 NER 的性能,达到了 0.927 的 F1 分数。本研究中提出的 NER 显著优于以前的系统。我们的实验还证明了 NER 嵌入外部特征以进一步提高系统性能的能力。我们相信,相同的方法可以应用于识别科学文献中的类似生物医学实体。