RastrOS项目：自然语言处理对巴西葡萄牙语眼动追踪语料库发展的贡献及可预测性规范

RastrOS Project: Natural Language Processing contributions to the development of an eye-tracking corpus with predictability norms for Brazilian Portuguese.

作者信息

Leal Sidney Evaldo, Lukasova Katerina, Carthery-Goulart Maria Teresa, Aluísio Sandra Maria

机构信息

Instituto de Ciências Matemáticas e de Computação - University of São Paulo, São Paulo, Brazil.

Center of Mathematics, Computing and Cognition, Federal University of ABC, São Paulo, Brazil.

出版信息

Lang Resour Eval. 2022;56(4):1333-1372. doi: 10.1007/s10579-022-09609-0. Epub 2022 Aug 17.

DOI:10.1007/s10579-022-09609-0

PMID:35990365

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9383681/

Abstract

This article presents RastrOS, a new eye-tracking corpus of eye movement data from university students during silent reading of paragraphs of texts in Brazilian Portuguese (BP). The article shows the potential of the corpus for natural language processing (NLP) using it to evaluate the sentence complexity prediction task in BP and it also focuses on the description of NLP resources and methods developed to create the corpus. Specifically, we present: (i) the method used to select the corpus paragraphs from large corpora, using linguistic metrics and clustering algorithms; (ii) the platform for collecting the Cloze test, which is also responsible for creating the project datasets, and (iii) the hybrid semantic similarity method, based on word embedding models and contextualised word representations, used to generate semantic predictability norms. RastrOS can be downloaded from the open science framework repository with the computational infrastructure mentioned above. Datasets with predictability norms of 393 participants and eye-tracking data of 37 participants are available in the OSF repository for this work (https://osf.io/9jxg3/).

摘要

本文介绍了RastrOS，这是一个关于大学生在默读巴西葡萄牙语（BP）文本段落时眼动数据的新的眼动追踪语料库。本文展示了该语料库在自然语言处理（NLP）方面的潜力，利用它来评估巴西葡萄牙语中的句子复杂度预测任务，并且还着重描述了为创建该语料库而开发的NLP资源和方法。具体而言，我们展示了：（i）使用语言指标和聚类算法从大型语料库中选择语料库段落的方法；（ii）用于收集完形填空测试的平台，该平台还负责创建项目数据集，以及（iii）基于词嵌入模型和上下文词表示的混合语义相似性方法，用于生成语义可预测性规范。RastrOS可从开放科学框架存储库下载，并带有上述计算基础设施。这项工作的OSF存储库中提供了393名参与者的可预测性规范数据集和37名参与者的眼动追踪数据（https://osf.io/9jxg3/）。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/044e/9383681/d8925bfaf6bf/10579_2022_9609_Fig1_HTML.jpg

相似文献

RastrOS Project: Natural Language Processing contributions to the development of an eye-tracking corpus with predictability norms for Brazilian Portuguese.

Lang Resour Eval. 2022;56(4):1333-1372. doi: 10.1007/s10579-022-09609-0. Epub 2022 Aug 17.

The Provo Corpus: A large eye-tracking corpus with predictability norms.

Behav Res Methods. 2018 Apr;50(2):826-833. doi: 10.3758/s13428-017-0908-4.

The Beijing Sentence Corpus: A Chinese sentence corpus with eye movement data and predictability norms.

Behav Res Methods. 2022 Aug;54(4):1989-2000. doi: 10.3758/s13428-021-01730-2. Epub 2021 Nov 23.

Sentence-final completion norms for 2925 Mexican Spanish sentence contexts.

Behav Res Methods. 2024 Mar;56(3):2486-2498. doi: 10.3758/s13428-023-02160-y. Epub 2023 Jul 5.

Russian Sentence Corpus: Benchmark measures of eye movements in reading in Russian.

Behav Res Methods. 2019 Jun;51(3):1161-1178. doi: 10.3758/s13428-018-1051-6.

Morphosyntactic but not lexical corpus-based probabilities can substitute for cloze probabilities in reading experiments.

PLoS One. 2021 Jan 28;16(1):e0246133. doi: 10.1371/journal.pone.0246133. eCollection 2021.

Lexical Predictability During Natural Reading: Effects of Surprisal and Entropy Reduction.

Cogn Sci. 2018 Jun;42 Suppl 4(Suppl 4):1166-1183. doi: 10.1111/cogs.12597. Epub 2018 Feb 14.

Language models outperform cloze predictability in a cognitive model of reading.

PLoS Comput Biol. 2024 Sep 25;20(9):e1012117. doi: 10.1371/journal.pcbi.1012117. eCollection 2024 Sep.

Linguistic networks associated with lexical, semantic and syntactic predictability in reading: A fixation-related fMRI study.

Neuroimage. 2019 Apr 1;189:224-240. doi: 10.1016/j.neuroimage.2019.01.018. Epub 2019 Jan 14.

Human and computer estimations of Predictability of words in written language.

Sci Rep. 2020 Mar 10;10(1):4396. doi: 10.1038/s41598-020-61353-z.

引用本文的文献

PoTeC: A German naturalistic eye-tracking-while-reading corpus.

Behav Res Methods. 2025 Jun 30;57(8):211. doi: 10.3758/s13428-024-02536-8.

EMTeC: A corpus of eye movements on machine-generated texts.

Behav Res Methods. 2025 Jun 3;57(7):189. doi: 10.3758/s13428-025-02677-4.

When function words carry content.

Q J Exp Psychol (Hove). 2024 Dec 26;78(10):17470218241307582. doi: 10.1177/17470218241307582.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

RastrOS项目：自然语言处理对巴西葡萄牙语眼动追踪语料库发展的贡献及可预测性规范

RastrOS Project: Natural Language Processing contributions to the development of an eye-tracking corpus with predictability norms for Brazilian Portuguese.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献