Suppr超能文献

基于推特的西班牙语变体区域化模型。

Regionalized models for Spanish language variations based on Twitter.

作者信息

Tellez Eric S, Moctezuma Daniela, Miranda Sabino, Graff Mario, Ruiz Guillermo

机构信息

Conacyt, Consejo Nacional de Ciencia y Tecnología., Av. Insurgentes Sur 1582, Col. Crédito Constructor., 03940 CDMX, Mexico.

INFOTEC, Centro de Investigación e Innovación en Tecnologías de la Información y Comunicación, Circuito Tecnopolo Norte, No.112 Col. Tecnopolo Pocitos II, 20326 Aguascalientes, Aguascalientes Mexico.

出版信息

Lang Resour Eval. 2023 Mar 2:1-31. doi: 10.1007/s10579-023-09640-9.

Abstract

Spanish is one of the most spoken languages in the world. Its proliferation comes with variations in written and spoken communication among different regions. Understanding language variations can help improve model performances on regional tasks, such as those involving figurative language and local context information. This manuscript presents and describes a set of regionalized resources for the Spanish language built on 4-year Twitter public messages geotagged in 26 Spanish-speaking countries. We introduce word embeddings based on FastText, language models based on BERT, and per-region sample corpora. We also provide a broad comparison among regions covering lexical and semantical similarities and examples of using regional resources on message classification tasks.

摘要

西班牙语是世界上使用最广泛的语言之一。其广泛传播伴随着不同地区书面和口头交流的差异。理解语言差异有助于提高模型在区域任务中的表现,例如那些涉及比喻语言和当地上下文信息的任务。本手稿展示并描述了一组基于在26个讲西班牙语国家进行地理标记的4年推特公开消息构建的西班牙语区域化资源。我们介绍了基于FastText的词嵌入、基于BERT的语言模型以及每个区域的样本语料库。我们还提供了各区域之间在词汇和语义相似性方面的广泛比较,以及在消息分类任务中使用区域资源的示例。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f25/9979884/436b9c8ef4df/10579_2023_9640_Fig1_HTML.jpg

相似文献

6
Crowdsourcing dialect characterization through Twitter.通过推特众包方言特征分析
PLoS One. 2014 Nov 19;9(11):e112074. doi: 10.1371/journal.pone.0112074. eCollection 2014.
7
Language modality shapes the dynamics of word and sign recognition.语言模态塑造了字词和符号识别的动态变化。
Cognition. 2019 Oct;191:103979. doi: 10.1016/j.cognition.2019.05.016. Epub 2019 Jun 21.
10
Improve word embedding using both writing and pronunciation.利用写作和发音来改进单词嵌入。
PLoS One. 2018 Dec 10;13(12):e0208785. doi: 10.1371/journal.pone.0208785. eCollection 2018.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验