Leal Sidney Evaldo, Lukasova Katerina, Carthery-Goulart Maria Teresa, Aluísio Sandra Maria
Instituto de Ciências Matemáticas e de Computação - University of São Paulo, São Paulo, Brazil.
Center of Mathematics, Computing and Cognition, Federal University of ABC, São Paulo, Brazil.
Lang Resour Eval. 2022;56(4):1333-1372. doi: 10.1007/s10579-022-09609-0. Epub 2022 Aug 17.
This article presents RastrOS, a new eye-tracking corpus of eye movement data from university students during silent reading of paragraphs of texts in Brazilian Portuguese (BP). The article shows the potential of the corpus for natural language processing (NLP) using it to evaluate the sentence complexity prediction task in BP and it also focuses on the description of NLP resources and methods developed to create the corpus. Specifically, we present: (i) the method used to select the corpus paragraphs from large corpora, using linguistic metrics and clustering algorithms; (ii) the platform for collecting the Cloze test, which is also responsible for creating the project datasets, and (iii) the hybrid semantic similarity method, based on word embedding models and contextualised word representations, used to generate semantic predictability norms. RastrOS can be downloaded from the open science framework repository with the computational infrastructure mentioned above. Datasets with predictability norms of 393 participants and eye-tracking data of 37 participants are available in the OSF repository for this work (https://osf.io/9jxg3/).
本文介绍了RastrOS,这是一个关于大学生在默读巴西葡萄牙语(BP)文本段落时眼动数据的新的眼动追踪语料库。本文展示了该语料库在自然语言处理(NLP)方面的潜力,利用它来评估巴西葡萄牙语中的句子复杂度预测任务,并且还着重描述了为创建该语料库而开发的NLP资源和方法。具体而言,我们展示了:(i)使用语言指标和聚类算法从大型语料库中选择语料库段落的方法;(ii)用于收集完形填空测试的平台,该平台还负责创建项目数据集,以及(iii)基于词嵌入模型和上下文词表示的混合语义相似性方法,用于生成语义可预测性规范。RastrOS可从开放科学框架存储库下载,并带有上述计算基础设施。这项工作的OSF存储库中提供了393名参与者的可预测性规范数据集和37名参与者的眼动追踪数据(https://osf.io/9jxg3/)。