Suppr超能文献

自然选择的词汇:寻找适应性特征。

The natural selection of words: Finding the features of fitness.

机构信息

Ronin Institute, Montclair, New Jersey, United States of America.

National Research Council Canada, Ottawa, Ontario, Canada.

出版信息

PLoS One. 2019 Jan 28;14(1):e0211512. doi: 10.1371/journal.pone.0211512. eCollection 2019.

Abstract

We introduce a dataset for studying the evolution of words, constructed from WordNet and the Google Books Ngram Corpus. The dataset tracks the evolution of 4,000 synonym sets (synsets), containing 9,000 English words, from 1800 AD to 2000 AD. We present a supervised learning algorithm that is able to predict the future leader of a synset: the word in the synset that will have the highest frequency. The algorithm uses features based on a word's length, the characters in the word, and the historical frequencies of the word. It can predict change of leadership (including the identity of the new leader) fifty years in the future, with an F-score considerably above random guessing. Analysis of the learned models provides insight into the causes of change in the leader of a synset. The algorithm confirms observations linguists have made, such as the trend to replace the -ise suffix with -ize, the rivalry between the -ity and -ness suffixes, and the struggle between economy (shorter words are easier to remember and to write) and clarity (longer words are more distinctive and less likely to be confused with one another). The results indicate that integration of the Google Books Ngram Corpus with WordNet has significant potential for improving our understanding of how language evolves.

摘要

我们介绍了一个用于研究词汇演变的数据集,该数据集由 WordNet 和 Google Books Ngram Corpus 构建而成。该数据集追踪了从 1800 年到 2000 年期间,4000 个同义词集(synsets)中包含的 9000 个英语单词的演变。我们提出了一种监督学习算法,该算法能够预测一个 synset 的未来领导者:即该 synset 中频率最高的单词。该算法使用基于单词长度、单词中的字符以及单词的历史频率的特征。它可以预测 50 年后的领导权变化(包括新领导者的身份),其 F 分数明显高于随机猜测。对学习模型的分析提供了对同义词集领导者变化原因的深入了解。该算法证实了语言学家的观察结果,例如用 -ize 替换 -ise 后缀的趋势、-ity 和 -ness 后缀之间的竞争以及经济(较短的单词更容易记忆和书写)与清晰度(较长的单词更具特色,不太可能相互混淆)之间的斗争。结果表明,将 Google Books Ngram Corpus 与 WordNet 集成具有显著提高我们对语言演变方式的理解的潜力。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验