Suppr超能文献

词汇库2:大规模词汇数据的预计算特征。

Lexibank 2: pre-computed features for large-scale lexical data.

作者信息

Blum Frederic, Barrientos Carlos, Englisch Johannes, Forkel Robert, Greenhill Simon J, Rzymski Christoph, List Johann-Mattis

机构信息

Department of Linguistic and Cultural Evolution, Max-Planck-Institute for Evolutionary Anthropology, Leipzig, Saxony, 04103, Germany.

Chair for Multilingual Computational Linguistics, Universitat Passau, Passau, Bavaria, Germany.

出版信息

Open Res Eur. 2025 May 9;5:126. doi: 10.12688/openreseurope.20216.1. eCollection 2025.

Abstract

Large-scale lexical and grammatical datasets nowadays play an important role in comparative linguistics. However, the lack of standardization remains a challenge exacerbating extension and reuse of published data. We present an updated version of Lexibank, a large-scale lexical dataset, expanding on previous efforts to standardize and unify cross-linguistic data. This new version includes over 3,100 languages and more than one-and-a-half million word forms, substantially broadening the scope and utility of the previous resource. Our dataset has been systematically curated using a dedicated computer-assisted workflow designed specifically for the lifting of published wordlist data to the standards recommended by the Cross-Linguistic Data Formats initiative. The expanded dataset features standardized references to language varieties, standardized semantic glosses that reference the concepts expressed by individual word forms, and standardized phonetic transcriptions for all word forms that our repository contains. Based on those standardizations we pre-compute semantic and phonological features, which can be used to carry out extensive automated analyses. We illustrate this potential by providing dedicated database queries to (1) infer words that are similar in pronunciation and meaning, (2) identify concepts that are colexified across languages in our sample, and (3) assess the semantic diversity of etymologically related words. These queries are not only fast to execute but also global in their scope, due to the largescale coverage provided by Lexibank 2. The queries are also easy to extend, thus having the potential to contribute to various studies in historical linguistics, linguistic typology, and related disciplines. The updated dataset is a substantial step forward in the effort to create comprehensive, standardized, and accessible linguistic resources.

摘要

如今,大规模的词汇和语法数据集在比较语言学中发挥着重要作用。然而,缺乏标准化仍然是一个挑战,加剧了已发表数据的扩展和重用难度。我们展示了Lexibank的更新版本,这是一个大规模的词汇数据集,在之前标准化和统一跨语言数据的努力基础上进行了扩展。这个新版本包含超过3100种语言和超过150万个词形,极大地拓宽了先前资源的范围和实用性。我们的数据集是通过专门设计的计算机辅助工作流程进行系统整理的,该工作流程专为将已发表的词表数据提升到跨语言数据格式倡议推荐的标准而设计。扩展后的数据集具有对语言变体的标准化引用、对各个词形所表达概念的标准化语义注释,以及对我们库中所有词形的标准化语音转录。基于这些标准化,我们预先计算语义和音系特征,可用于进行广泛的自动分析。我们通过提供专门的数据库查询来说明这种潜力:(1) 推断发音和意义相似的词;(2) 在我们的样本中识别跨语言共词化的概念;(3) 评估词源相关词的语义多样性。由于Lexibank 2提供的大规模覆盖,这些查询不仅执行速度快,而且范围广泛。这些查询也易于扩展,因此有可能为历史语言学、语言类型学及相关学科的各种研究做出贡献。更新后的数据集是朝着创建全面、标准化且可访问的语言资源迈出的重要一步。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e92f/12186020/eb4474f94047/openreseurope-5-22414-g0000.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验