Department of Biomedical Informatics, Arizona State University, Tempe, AZ, USA.
Center for Clinical and Translational Science, University of Vermont, Burlington, VT, USA; Department of Microbiology & Molecular Genetics, University of Vermont, Burlington, VT, USA; Department of Computer Science, University of Vermont, Burlington, VT, USA.
J Biomed Inform. 2011 Dec;44 Suppl 1(Suppl 1):S44-S47. doi: 10.1016/j.jbi.2011.06.005. Epub 2011 Jun 24.
Phylogeography is a field that focuses on the geographical lineages of species such as vertebrates or viruses. Here, geographical data, such as location of a species or viral host is as important as the sequence information extracted from the species. Together, this information can help illustrate the migration of the species over time within a geographical area, the impact of geography over the evolutionary history, or the expected population of the species within the area. Molecular sequence data from NCBI, specifically GenBank, provide an abundance of available sequence data for phylogeography. However, geographical data is inconsistently represented and sparse across GenBank entries. This can impede analysis and in situations where the geographical information is inferred, and potentially lead to erroneous results. In this paper, we describe the current state of geographical data in GenBank, and illustrate how automated processing techniques such as named entity recognition, can enhance the geographical data available for phylogeographic studies.
系统发生地理学是一个专注于脊椎动物或病毒等物种的地理谱系的领域。在这里,地理数据(如物种或病毒宿主的位置)与从物种中提取的序列信息一样重要。这些信息可以帮助说明物种在地理区域内随时间的迁移、地理对进化历史的影响,或该区域内物种的预期数量。来自 NCBI(特别是 GenBank)的分子序列数据提供了大量可用的序列数据,用于系统发生地理学研究。然而,GenBank 条目中的地理数据表示不一致且稀疏。这可能会阻碍分析,并且在地理信息是推断得出的情况下,可能会导致错误的结果。在本文中,我们描述了 GenBank 中地理数据的现状,并说明了命名实体识别等自动化处理技术如何增强可供系统发生地理研究使用的地理数据。