Laboratorio de Inteligencia Artificial Aplicada, Instituto de Ciencias de la Computación, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires - Consejo Nacional de Investigación en Ciencia y Técnica, Ciudad Autónoma de Buenos Aires, Argentina.
Departamento de Computación, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Ciudad Autónoma de Buenos Aires, Argentina.
Sci Rep. 2020 Mar 10;10(1):4396. doi: 10.1038/s41598-020-61353-z.
When we read printed text, we are continuously predicting upcoming words to integrate information and guide future eye movements. Thus, the Predictability of a given word has become one of the most important variables when explaining human behaviour and information processing during reading. In parallel, the Natural Language Processing (NLP) field evolved by developing a wide variety of applications. Here, we show that using different word embeddings techniques (like Latent Semantic Analysis, Word2Vec, and FastText) and N-gram-based language models we were able to estimate how humans predict words (cloze-task Predictability) and how to better understand eye movements in long Spanish texts. Both types of models partially captured aspects of predictability. On the one hand, our N-gram model performed well when added as a replacement for the cloze-task Predictability of the fixated word. On the other hand, word embeddings were useful to mimic Predictability of the following word. Our study joins efforts from neurolinguistic and NLP fields to understand human information processing during reading to potentially improve NLP algorithms.
当我们阅读印刷文本时,我们会不断预测接下来的单词,以整合信息并指导未来的眼球运动。因此,给定单词的可预测性已成为解释阅读过程中人类行为和信息处理的最重要变量之一。与此同时,自然语言处理 (NLP) 领域通过开发各种应用程序得到了发展。在这里,我们展示了使用不同的词嵌入技术(如潜在语义分析、Word2Vec 和 FastText)和基于 N 元组的语言模型,我们能够估计人类如何预测单词( cloze-task Predictability )以及如何更好地理解长西班牙文本中的眼球运动。这两种类型的模型都部分捕捉到了可预测性的各个方面。一方面,我们的 N 元组模型在作为注视词 cloze-task Predictability 的替代物添加时表现良好。另一方面,词嵌入对于模拟下一个词的可预测性很有用。我们的研究结合了神经语言学和 NLP 领域的努力,以了解阅读过程中的人类信息处理,从而有可能改进 NLP 算法。