Department of Computer Science, Cornell University, Ithaca, NY, 14850, USA.
Department of Linguistics, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
Nat Commun. 2022 Aug 30;13(1):5024. doi: 10.1038/s41467-022-32012-w.
Automated, data-driven construction and evaluation of scientific models and theories is a long-standing challenge in artificial intelligence. We present a framework for algorithmically synthesizing models of a basic part of human language: morpho-phonology, the system that builds word forms from sounds. We integrate Bayesian inference with program synthesis and representations inspired by linguistic theory and cognitive models of learning and discovery. Across 70 datasets from 58 diverse languages, our system synthesizes human-interpretable models for core aspects of each language's morpho-phonology, sometimes approaching models posited by human linguists. Joint inference across all 70 data sets automatically synthesizes a meta-model encoding interpretable cross-language typological tendencies. Finally, the same algorithm captures few-shot learning dynamics, acquiring new morphophonological rules from just one or a few examples. These results suggest routes to more powerful machine-enabled discovery of interpretable models in linguistics and other scientific domains.
自动化、数据驱动的科学模型和理论构建与评估是人工智能领域的一个长期挑战。我们提出了一个框架,用于从算法上合成人类语言基本部分的模型:形态音系学,即从声音构建单词形式的系统。我们将贝叶斯推理与程序综合以及受语言学理论和学习与发现的认知模型启发的表示相结合。在来自 58 种不同语言的 70 个数据集上,我们的系统为每种语言的形态音系学的核心方面综合了人类可解释的模型,有时接近人类语言学家提出的模型。对所有 70 个数据集的联合推断自动综合了一个元模型,该模型编码了可解释的跨语言类型学倾向。最后,同一个算法捕获了少样本学习动态,仅从一个或几个示例中获取新的形态音系规则。这些结果表明,在语言学和其他科学领域,通过更强大的机器能够发现可解释的模型的途径。