Suppr超能文献

使用词汇语言模型检测单语词汇表中的外来词。

Using lexical language models to detect borrowings in monolingual wordlists.

机构信息

Artificial Intelligence/Engineering, Pontificia Universidad Católica del Perú, San Miguel, Lima, Peru.

Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, Jena, Germany.

出版信息

PLoS One. 2020 Dec 9;15(12):e0242709. doi: 10.1371/journal.pone.0242709. eCollection 2020.

Abstract

Lexical borrowing, the transfer of words from one language to another, is one of the most frequent processes in language evolution. In order to detect borrowings, linguists make use of various strategies, combining evidence from various sources. Despite the increasing popularity of computational approaches in comparative linguistics, automated approaches to lexical borrowing detection are still in their infancy, disregarding many aspects of the evidence that is routinely considered by human experts. One example for this kind of evidence are phonological and phonotactic clues that are especially useful for the detection of recent borrowings that have not yet been adapted to the structure of their recipient languages. In this study, we test how these clues can be exploited in automated frameworks for borrowing detection. By modeling phonology and phonotactics with the support of Support Vector Machines, Markov models, and recurrent neural networks, we propose a framework for the supervised detection of borrowings in mono-lingual wordlists. Based on a substantially revised dataset in which lexical borrowings have been thoroughly annotated for 41 different languages from different families, featuring a large typological diversity, we use these models to conduct a series of experiments to investigate their performance in mono-lingual borrowing detection. While the general results appear largely unsatisfying at a first glance, further tests show that the performance of our models improves with increasing amounts of attested borrowings and in those cases where most borrowings were introduced by one donor language alone. Our results show that phonological and phonotactic clues derived from monolingual language data alone are often not sufficient to detect borrowings when using them in isolation. Based on our detailed findings, however, we express hope that they could prove to be useful in integrated approaches that take multi-lingual information into account.

摘要

词汇借用,即将词语从一种语言转移到另一种语言,是语言演变中最常见的过程之一。为了检测借用词,语言学家利用各种策略,结合来自各种来源的证据。尽管在比较语言学中,计算方法越来越受欢迎,但自动词汇借用检测方法仍处于起步阶段,忽略了人类专家通常考虑的许多证据方面。这种证据的一个例子是语音和音系学线索,它们对于检测尚未适应其接受语言结构的近期借用词特别有用。在这项研究中,我们测试了如何在自动借用检测框架中利用这些线索。通过支持向量机、马尔可夫模型和递归神经网络来对语音学和音系学进行建模,我们提出了一个用于单语词汇表中借用词检测的监督框架。基于一个经过大量修订的数据集,该数据集对来自不同语系的 41 种不同语言的词汇借用进行了彻底的标注,具有很大的类型多样性,我们使用这些模型进行了一系列实验,以调查它们在单语借用检测中的性能。虽然一般结果乍一看令人不满意,但进一步的测试表明,随着被证实的借用词数量的增加,以及在大多数借用词仅由一种来源语言引入的情况下,我们的模型的性能会有所提高。我们的研究结果表明,仅从单语语言数据中提取的语音学和音系学线索在单独使用时,通常不足以检测借用词。然而,基于我们的详细发现,我们希望它们可以在考虑多语言信息的集成方法中证明是有用的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/abba/7725347/b418863d254b/pone.0242709.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验