Suppr超能文献

EMTeC:机器生成文本上的眼动语料库。

EMTeC: A corpus of eye movements on machine-generated texts.

作者信息

Bolliger Lena S, Haller Patrick, Cretton Isabelle C R, Reich David R, Kew Tannon, Jäger Lena A

机构信息

Department of Computational Linguistics, University of Zurich, Andreasstrasse 15, Zurich, 8050, Switzerland.

Department of Computer Science, University of Potsdam, An der Bahn 2, Potsdam, 14476, Germany.

出版信息

Behav Res Methods. 2025 Jun 3;57(7):189. doi: 10.3758/s13428-025-02677-4.

Abstract

The Eye movements on Machine-generated Texts Corpus (EMTeC) is a naturalistic eye-movements-while-reading corpus of 107 native English speakers reading machine-generated texts. The texts are generated by three large language models using five different decoding strategies, and they fall into six different text-type categories. EMTeC entails the eye movement data at all stages of pre-processing, i.e., the raw coordinate data sampled at 2000 Hz, the fixation sequences, and the reading measures. It further provides both the original and a corrected version of the fixation sequences, accounting for vertical calibration drift. Moreover, the corpus includes the language models' internals that underlie the generation of the stimulus texts: the transition scores, the attention scores, and the hidden states. The stimuli are annotated for a range of linguistic features both at text and at word level. We anticipate EMTeC to be utilized for a variety of use cases such as, but not restricted to, the investigation of reading behavior on machine-generated text and the impact of different decoding strategies; reading behavior on different text types; the development of new pre-processing, data filtering, and drift correction algorithms; the cognitive interpretability and enhancement of language models; and the assessment of the predictive power of surprisal and entropy for human reading times. The data at all stages of pre-processing, the model internals, and the code to reproduce the stimulus generation, data pre-processing, and analyses can be accessed via https://github.com/DiLi-Lab/EMTeC/ .

摘要

机器生成文本语料库上的眼动数据(EMTeC)是一个关于阅读机器生成文本时的自然主义眼动语料库,由107名以英语为母语的人阅读机器生成文本组成。这些文本由三个大语言模型使用五种不同的解码策略生成,分为六种不同的文本类型类别。EMTeC包含预处理各阶段的眼动数据,即2000赫兹采样的原始坐标数据、注视序列和阅读测量数据。它还提供了注视序列的原始版本和校正版本,以考虑垂直校准漂移。此外,该语料库包括刺激文本生成背后的语言模型内部数据:转移分数、注意力分数和隐藏状态。刺激文本在文本和单词层面都标注了一系列语言特征。我们预计EMTeC可用于多种用例,例如但不限于,研究在机器生成文本上的阅读行为以及不同解码策略的影响;不同文本类型上的阅读行为;开发新的预处理、数据过滤和漂移校正算法;语言模型的认知可解释性和增强;以及评估惊奇度和熵对人类阅读时间的预测能力。预处理各阶段的数据、模型内部数据以及用于重现刺激生成、数据预处理和分析的代码可通过https://github.com/DiLi-Lab/EMTeC/获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8f54/12134054/9f0e191de4ca/13428_2025_2677_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验