Department of Computer Science, University of Pisa, Largo Bruno Pontecorvo 3, Pisa, Italy.
Institute of Information Science and Technologies "A. Faedo" (ISTI), National Research Council (CNR), G. Moruzzi 1, Pisa, Italy.
Sci Rep. 2023 Jan 26;13(1):1474. doi: 10.1038/s41598-022-27029-6.
Knowledge in the human mind exhibits a dualistic vector/network nature. Modelling words as vectors is key to natural language processing, whereas networks of word associations can map the nature of semantic memory. We reconcile these paradigms-fragmented across linguistics, psychology and computer science-by introducing FEature-Rich MUltiplex LEXical (FERMULEX) networks. This novel framework merges structural similarities in networks and vector features of words, which can be combined or explored independently. Similarities model heterogenous word associations across semantic/syntactic/phonological aspects of knowledge. Words are enriched with multi-dimensional feature embeddings including frequency, age of acquisition, length and polysemy. These aspects enable unprecedented explorations of cognitive knowledge. Through CHILDES data, we use FERMULEX networks to model normative language acquisition by 1000 toddlers between 18 and 30 months. Similarities and embeddings capture word homophily via conformity, which measures assortative mixing via distance and features. Conformity unearths a language kernel of frequent/polysemous/short nouns and verbs key for basic sentence production, supporting recent evidence of children's syntactic constructs emerging at 30 months. This kernel is invisible to network core-detection and feature-only clustering: It emerges from the dual vector/network nature of words. Our quantitative analysis reveals two key strategies in early word learning. Modelling word acquisition as random walks on FERMULEX topology, we highlight non-uniform filling of communicative developmental inventories (CDIs). Biased random walkers lead to accurate (75%), precise (55%) and partially well-recalled (34%) predictions of early word learning in CDIs, providing quantitative support to previous empirical findings and developmental theories.
知识在人类思维中表现出二元向量/网络性质。将单词建模为向量是自然语言处理的关键,而单词联想网络可以映射语义记忆的本质。我们通过引入 FEature-Rich MUltiplex LEXical (FERMULEX) 网络来调和这些分散在语言学、心理学和计算机科学中的范式。这个新框架融合了网络的结构相似性和单词的向量特征,可以将它们组合或独立探索。相似性模型跨越知识的语义、句法和语音等方面的异质单词联想。单词通过多维特征嵌入得到丰富,包括频率、习得年龄、长度和多义性。这些方面使我们能够对认知知识进行前所未有的探索。通过使用 CHILDES 数据,我们使用 FERMULEX 网络来对 1000 名 18 至 30 个月的幼儿进行规范语言习得建模。相似性和嵌入通过一致性捕捉单词同质性,一致性通过距离和特征来衡量同配性混合。一致性揭示了一个频繁/多义/短的名词和动词语言核心,这些是基本句子生成的关键,支持了最近的证据,即儿童的句法结构在 30 个月时出现。这个核心对于网络核心检测和仅特征聚类是不可见的:它源于单词的二元向量/网络性质。我们的定量分析揭示了早期词汇学习中的两个关键策略。我们将单词习得建模为 FERMULEX 拓扑上的随机游走,突出了交际发展清单 (CDI) 的非均匀填充。有偏差的随机游走者可以准确(75%)、精确(55%)和部分召回(34%)地预测 CDI 中的早期词汇学习,为先前的实证发现和发展理论提供了定量支持。