Grieve Jack, Montgomery Chris, Nini Andrea, Murakami Akira, Guo Diansheng
Department of English Language and Linguistics, University of Birmingham, Birmingham, United Kingdom.
School of English, University of Sheffield, Sheffield, United Kingdom.
Front Artif Intell. 2019 Jul 12;2:11. doi: 10.3389/frai.2019.00011. eCollection 2019.
There is a growing trend in regional dialectology to analyse large corpora of social media data, but it is unclear if the results of these studies can be generalized to language as a whole. To assess the generalizability of Twitter dialect maps, this paper presents the first systematic comparison of regional lexical variation in Twitter corpora and traditional survey data. We compare the regional patterns found in 139 lexical dialect maps based on a 1.8 billion word corpus of geolocated UK Twitter data and the BBC Voices dialect survey. A spatial analysis of these 139 map pairs finds a broad alignment between these two data sources, offering evidence that both approaches to data collection allow for the same basic underlying regional patterns to be identified. We argue that these results license the use of Twitter corpora for general inquiries into regional lexical variation and change.
在区域方言学领域,分析大量社交媒体数据的趋势日益明显,但这些研究结果能否推广至整个语言尚不明晰。为评估推特方言地图的可推广性,本文首次对推特语料库中的区域词汇变异与传统调查数据进行了系统比较。我们基于18亿词的英国推特地理位置数据语料库和英国广播公司语音方言调查,比较了139个词汇方言地图中发现的区域模式。对这139组地图的空间分析发现,这两个数据源之间存在广泛的一致性,这表明两种数据收集方法都能识别出相同的基本潜在区域模式。我们认为,这些结果证明了使用推特语料库对区域词汇变异和变化进行一般性探究的合理性。