Suppr超能文献

使用 spaCy、Nominatim 和 Google Maps 提高推文文本位置推断的地理编码精度。对数据选择影响的比较分析。

Enhanced geocoding precision for location inference of tweet text using spaCy, Nominatim and Google Maps. A comparative analysis of the influence of data selection.

机构信息

Department of Digital and Analytical Sciences, University of Salzburg, Salzburg, Austria.

Centre for Geographic Analysis, Harvard University, Cambridge, MA, United States of America.

出版信息

PLoS One. 2023 Mar 15;18(3):e0282942. doi: 10.1371/journal.pone.0282942. eCollection 2023.

Abstract

Twitter location inference methods are developed with the purpose of increasing the percentage of geotagged tweets by inferring locations on a non-geotagged dataset. For validation of proposed approaches, these location inference methods are developed on a fully geotagged dataset on which the attached Global Navigation Satellite System coordinates are used as ground truth data. Whilst a substantial number of location inference methods have been developed to date, questions arise pertaining the generalizability of the developed location inference models on a non-geotagged dataset. This paper proposes a high precision location inference method for inferring tweets' point of origin based on location mentions within the tweet text. We investigate the influence of data selection by comparing the model performance on two datasets. For the first dataset, we use a proportionate sample of tweet sources of a geotagged dataset. For the second dataset, we use a modelled distribution of tweet sources following a non-geotagged dataset. Our results showed that the distribution of tweet sources influences the performance of location inference models. Using the first dataset we outweighed state-of-the-art location extraction models by inferring 61.9%, 86.1% and 92.1% of the extracted locations within 1 km, 10 km and 50 km radius values, respectively. However, using the second dataset our precision values dropped to 45.3%, 73.1% and 81.0% for the same radius values.

摘要

Twitter 位置推断方法是为了通过推断非地理标记数据集上的位置来提高地理标记推文的百分比而开发的。为了验证提出的方法,这些位置推断方法是在完全地理标记的数据上开发的,其中附加的全球导航卫星系统坐标被用作地面真实数据。虽然迄今为止已经开发了大量的位置推断方法,但对于在非地理标记数据上开发的位置推断模型的通用性仍存在疑问。本文提出了一种基于推文文本中位置提及的高精度位置推断方法,用于推断推文的起源点。我们通过比较两个数据集上的模型性能来研究数据选择的影响。对于第一个数据集,我们使用地理标记数据集的推文源的比例样本。对于第二个数据集,我们使用遵循非地理标记数据集的推文源的模型分布。我们的结果表明,推文源的分布会影响位置推断模型的性能。使用第一个数据集,我们通过推断提取位置的 61.9%、86.1%和 92.1%,在 1 公里、10 公里和 50 公里半径值内分别超过了最先进的位置提取模型。然而,使用第二个数据集,我们的精度值分别下降到 45.3%、73.1%和 81.0%,对于相同的半径值。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/651b/10016707/4ae62077851f/pone.0282942.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验