Blum Frederic, Barrientos Carlos, Englisch Johannes, Forkel Robert, Greenhill Simon J, Rzymski Christoph, List Johann-Mattis
Department of Linguistic and Cultural Evolution, Max-Planck-Institute for Evolutionary Anthropology, Leipzig, Saxony, 04103, Germany.
Chair for Multilingual Computational Linguistics, Universitat Passau, Passau, Bavaria, Germany.
Open Res Eur. 2025 May 9;5:126. doi: 10.12688/openreseurope.20216.1. eCollection 2025.
Large-scale lexical and grammatical datasets nowadays play an important role in comparative linguistics. However, the lack of standardization remains a challenge exacerbating extension and reuse of published data. We present an updated version of Lexibank, a large-scale lexical dataset, expanding on previous efforts to standardize and unify cross-linguistic data. This new version includes over 3,100 languages and more than one-and-a-half million word forms, substantially broadening the scope and utility of the previous resource. Our dataset has been systematically curated using a dedicated computer-assisted workflow designed specifically for the lifting of published wordlist data to the standards recommended by the Cross-Linguistic Data Formats initiative. The expanded dataset features standardized references to language varieties, standardized semantic glosses that reference the concepts expressed by individual word forms, and standardized phonetic transcriptions for all word forms that our repository contains. Based on those standardizations we pre-compute semantic and phonological features, which can be used to carry out extensive automated analyses. We illustrate this potential by providing dedicated database queries to (1) infer words that are similar in pronunciation and meaning, (2) identify concepts that are colexified across languages in our sample, and (3) assess the semantic diversity of etymologically related words. These queries are not only fast to execute but also global in their scope, due to the largescale coverage provided by Lexibank 2. The queries are also easy to extend, thus having the potential to contribute to various studies in historical linguistics, linguistic typology, and related disciplines. The updated dataset is a substantial step forward in the effort to create comprehensive, standardized, and accessible linguistic resources.
如今,大规模的词汇和语法数据集在比较语言学中发挥着重要作用。然而,缺乏标准化仍然是一个挑战,加剧了已发表数据的扩展和重用难度。我们展示了Lexibank的更新版本,这是一个大规模的词汇数据集,在之前标准化和统一跨语言数据的努力基础上进行了扩展。这个新版本包含超过3100种语言和超过150万个词形,极大地拓宽了先前资源的范围和实用性。我们的数据集是通过专门设计的计算机辅助工作流程进行系统整理的,该工作流程专为将已发表的词表数据提升到跨语言数据格式倡议推荐的标准而设计。扩展后的数据集具有对语言变体的标准化引用、对各个词形所表达概念的标准化语义注释,以及对我们库中所有词形的标准化语音转录。基于这些标准化,我们预先计算语义和音系特征,可用于进行广泛的自动分析。我们通过提供专门的数据库查询来说明这种潜力:(1) 推断发音和意义相似的词;(2) 在我们的样本中识别跨语言共词化的概念;(3) 评估词源相关词的语义多样性。由于Lexibank 2提供的大规模覆盖,这些查询不仅执行速度快,而且范围广泛。这些查询也易于扩展,因此有可能为历史语言学、语言类型学及相关学科的各种研究做出贡献。更新后的数据集是朝着创建全面、标准化且可访问的语言资源迈出的重要一步。