Suppr超能文献

RastrOS项目:自然语言处理对巴西葡萄牙语眼动追踪语料库发展的贡献及可预测性规范

RastrOS Project: Natural Language Processing contributions to the development of an eye-tracking corpus with predictability norms for Brazilian Portuguese.

作者信息

Leal Sidney Evaldo, Lukasova Katerina, Carthery-Goulart Maria Teresa, Aluísio Sandra Maria

机构信息

Instituto de Ciências Matemáticas e de Computação - University of São Paulo, São Paulo, Brazil.

Center of Mathematics, Computing and Cognition, Federal University of ABC, São Paulo, Brazil.

出版信息

Lang Resour Eval. 2022;56(4):1333-1372. doi: 10.1007/s10579-022-09609-0. Epub 2022 Aug 17.

Abstract

This article presents RastrOS, a new eye-tracking corpus of eye movement data from university students during silent reading of paragraphs of texts in Brazilian Portuguese (BP). The article shows the potential of the corpus for natural language processing (NLP) using it to evaluate the sentence complexity prediction task in BP and it also focuses on the description of NLP resources and methods developed to create the corpus. Specifically, we present: (i) the method used to select the corpus paragraphs from large corpora, using linguistic metrics and clustering algorithms; (ii) the platform for collecting the Cloze test, which is also responsible for creating the project datasets, and (iii) the hybrid semantic similarity method, based on word embedding models and contextualised word representations, used to generate semantic predictability norms. RastrOS can be downloaded from the open science framework repository with the computational infrastructure mentioned above. Datasets with predictability norms of 393 participants and eye-tracking data of 37 participants are available in the OSF repository for this work (https://osf.io/9jxg3/).

摘要

本文介绍了RastrOS,这是一个关于大学生在默读巴西葡萄牙语(BP)文本段落时眼动数据的新的眼动追踪语料库。本文展示了该语料库在自然语言处理(NLP)方面的潜力,利用它来评估巴西葡萄牙语中的句子复杂度预测任务,并且还着重描述了为创建该语料库而开发的NLP资源和方法。具体而言,我们展示了:(i)使用语言指标和聚类算法从大型语料库中选择语料库段落的方法;(ii)用于收集完形填空测试的平台,该平台还负责创建项目数据集,以及(iii)基于词嵌入模型和上下文词表示的混合语义相似性方法,用于生成语义可预测性规范。RastrOS可从开放科学框架存储库下载,并带有上述计算基础设施。这项工作的OSF存储库中提供了393名参与者的可预测性规范数据集和37名参与者的眼动追踪数据(https://osf.io/9jxg3/)。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/044e/9383681/d8925bfaf6bf/10579_2022_9609_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验