• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

词汇库2:大规模词汇数据的预计算特征。

Lexibank 2: pre-computed features for large-scale lexical data.

作者信息

Blum Frederic, Barrientos Carlos, Englisch Johannes, Forkel Robert, Greenhill Simon J, Rzymski Christoph, List Johann-Mattis

机构信息

Department of Linguistic and Cultural Evolution, Max-Planck-Institute for Evolutionary Anthropology, Leipzig, Saxony, 04103, Germany.

Chair for Multilingual Computational Linguistics, Universitat Passau, Passau, Bavaria, Germany.

出版信息

Open Res Eur. 2025 May 9;5:126. doi: 10.12688/openreseurope.20216.1. eCollection 2025.

DOI:10.12688/openreseurope.20216.1
PMID:40469274
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12134731/
Abstract

Large-scale lexical and grammatical datasets nowadays play an important role in comparative linguistics. However, the lack of standardization remains a challenge exacerbating extension and reuse of published data. We present an updated version of Lexibank, a large-scale lexical dataset, expanding on previous efforts to standardize and unify cross-linguistic data. This new version includes over 3,100 languages and more than one-and-a-half million word forms, substantially broadening the scope and utility of the previous resource. Our dataset has been systematically curated using a dedicated computer-assisted workflow designed specifically for the lifting of published wordlist data to the standards recommended by the Cross-Linguistic Data Formats initiative. The expanded dataset features standardized references to language varieties, standardized semantic glosses that reference the concepts expressed by individual word forms, and standardized phonetic transcriptions for all word forms that our repository contains. Based on those standardizations we pre-compute semantic and phonological features, which can be used to carry out extensive automated analyses. We illustrate this potential by providing dedicated database queries to (1) infer words that are similar in pronunciation and meaning, (2) identify concepts that are colexified across languages in our sample, and (3) assess the semantic diversity of etymologically related words. These queries are not only fast to execute but also global in their scope, due to the largescale coverage provided by Lexibank 2. The queries are also easy to extend, thus having the potential to contribute to various studies in historical linguistics, linguistic typology, and related disciplines. The updated dataset is a substantial step forward in the effort to create comprehensive, standardized, and accessible linguistic resources.

摘要

如今,大规模的词汇和语法数据集在比较语言学中发挥着重要作用。然而,缺乏标准化仍然是一个挑战,加剧了已发表数据的扩展和重用难度。我们展示了Lexibank的更新版本,这是一个大规模的词汇数据集,在之前标准化和统一跨语言数据的努力基础上进行了扩展。这个新版本包含超过3100种语言和超过150万个词形,极大地拓宽了先前资源的范围和实用性。我们的数据集是通过专门设计的计算机辅助工作流程进行系统整理的,该工作流程专为将已发表的词表数据提升到跨语言数据格式倡议推荐的标准而设计。扩展后的数据集具有对语言变体的标准化引用、对各个词形所表达概念的标准化语义注释,以及对我们库中所有词形的标准化语音转录。基于这些标准化,我们预先计算语义和音系特征,可用于进行广泛的自动分析。我们通过提供专门的数据库查询来说明这种潜力:(1) 推断发音和意义相似的词;(2) 在我们的样本中识别跨语言共词化的概念;(3) 评估词源相关词的语义多样性。由于Lexibank 2提供的大规模覆盖,这些查询不仅执行速度快,而且范围广泛。这些查询也易于扩展,因此有可能为历史语言学、语言类型学及相关学科的各种研究做出贡献。更新后的数据集是朝着创建全面、标准化且可访问的语言资源迈出的重要一步。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e92f/12186020/d4ecc36fbb2c/openreseurope-5-22414-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e92f/12186020/eb4474f94047/openreseurope-5-22414-g0000.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e92f/12186020/d4ecc36fbb2c/openreseurope-5-22414-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e92f/12186020/eb4474f94047/openreseurope-5-22414-g0000.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e92f/12186020/d4ecc36fbb2c/openreseurope-5-22414-g0001.jpg

相似文献

1
Lexibank 2: pre-computed features for large-scale lexical data.词汇库2:大规模词汇数据的预计算特征。
Open Res Eur. 2025 May 9;5:126. doi: 10.12688/openreseurope.20216.1. eCollection 2025.
2
Short-Term Memory Impairment短期记忆障碍
3
Sexual Harassment and Prevention Training性骚扰与预防培训
4
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
5
Systemic Inflammatory Response Syndrome全身炎症反应综合征
6
The Black Book of Psychotropic Dosing and Monitoring.《精神药物剂量与监测黑皮书》
Psychopharmacol Bull. 2024 Jul 8;54(3):8-59.
7
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病:网络荟萃分析。
Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.
8
The linguocultural concept of /pandemic/пaндeмия in Chinese, English, and Russian linguistic consciousness.中文、英文和俄文语言意识中“大流行”/pandemic/пaндeмия的语言文化概念。
Philos Ethics Humanit Med. 2025 Jul 9;20(1):13. doi: 10.1186/s13010-025-00168-0.
9
Behavioral interventions to reduce risk for sexual transmission of HIV among men who have sex with men.降低男男性行为者中艾滋病毒性传播风险的行为干预措施。
Cochrane Database Syst Rev. 2008 Jul 16(3):CD001230. doi: 10.1002/14651858.CD001230.pub2.
10
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.慢性斑块状银屑病的全身药理学治疗:一项网状Meta分析。
Cochrane Database Syst Rev. 2020 Jan 9;1(1):CD011535. doi: 10.1002/14651858.CD011535.pub3.

本文引用的文献

1
Cognate reflex prediction as hypothesis test for a genealogical relation between the Panoan and Takanan language families.同源反射预测作为帕诺安语系和塔卡南语系之间谱系关系的假设检验。
Sci Rep. 2024 Dec 24;14(1):30636. doi: 10.1038/s41598-024-82515-3.
2
Consonant lengthening marks the beginning of words across a diverse sample of languages.辅音拉长标记着各种语言中单词的开始。
Nat Hum Behav. 2024 Nov;8(11):2127-2138. doi: 10.1038/s41562-024-01988-4. Epub 2024 Sep 24.
3
A comparative wordlist for investigating distant relations among languages in Lowland South America.
《用于调查南美洲低地地区语言之间远缘关系的比较词表》
Sci Data. 2024 Jan 18;11(1):92. doi: 10.1038/s41597-024-02928-7.
4
Lexical data for the historical comparison of Rgyalrongic languages.嘉绒语支语言历史比较的词汇数据。
Open Res Eur. 2023 Oct 18;3:99. doi: 10.12688/openreseurope.16017.2. eCollection 2023.
5
Curating and extending data for language comparison in Concepticon and NoRaRe.在概念库(Concepticon)和NoRaRe中整理并扩充用于语言比较的数据。
Open Res Eur. 2023 May 24;2:141. doi: 10.12688/openreseurope.15380.3. eCollection 2022.
6
First steps towards the detection of contact layers in Bangime: a multi-disciplinary, computer-assisted approach.在班吉姆检测接触层的初步步骤:一种多学科的计算机辅助方法。
Open Res Eur. 2022 Apr 22;2:10. doi: 10.12688/openreseurope.14339.2. eCollection 2022.
7
Societies of strangers do not speak less complex languages.陌生人社会的语言并不简单。
Sci Adv. 2023 Aug 18;9(33):eadf7704. doi: 10.1126/sciadv.adf7704. Epub 2023 Aug 16.
8
Inference of partial colexifications from multilingual wordlists.从多语言词表推断部分共词化现象
Front Psychol. 2023 Jun 16;14:1156540. doi: 10.3389/fpsyg.2023.1156540. eCollection 2023.
9
Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss.格兰班克揭示了谱系约束对语言多样性的重要性,并强调了语言丧失的影响。
Sci Adv. 2023 Apr 21;9(16):eadg6175. doi: 10.1126/sciadv.adg6175. Epub 2023 Apr 19.
10
Triangulation supports agricultural spread of the Transeurasian languages.三角测量法支持了泛欧亚语系在农业上的传播。
Nature. 2021 Nov;599(7886):616-621. doi: 10.1038/s41586-021-04108-8. Epub 2021 Nov 10.