一个中文分词一致性语料库。

A corpus of Chinese word segmentation agreement.

作者信息

Tsang Yiu-Kei, Yan Ming, Pan Jinger, Chan Megan Yin Kan

机构信息

Department of Education Studies, Hong Kong Baptist University, Kowloon Tong, Kowloon, Hong Kong.

Centre for Learning Sciences, Hong Kong Baptist University, Kowloon, Hong Kong.

出版信息

Behav Res Methods. 2024 Dec 28;57(1):25. doi: 10.3758/s13428-024-02528-8.

DOI:10.3758/s13428-024-02528-8

PMID:39733209

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11682008/

Abstract

The absence of explicit word boundaries is a distinctive characteristic of Chinese script, setting it apart from most alphabetic scripts, leading to word boundary disagreement among readers. Previous studies have examined how this feature may influence reading performance. However, further investigations are required to generate more ecologically valid and generalizable findings. In order to advance our understanding of the impact of word boundaries in Chinese reading, we introduce the Chinese Word Segmentation Agreement (CWSA) corpus. This corpus consists of 500 sentences, comprising 9813 character tokens and 1590 character types, and provides data on word segmentation agreement at each character position. The data revealed a high level of overall segmentation agreement (92%). However, participants disagreed on the position of word boundaries in 8.96% of the cases. Moreover, about 85% of the sentences contained at least one ambiguous word boundary. The character strings with high levels of disagreement were tentatively classified into three categories, namely the morphosyntactic type (e.g., "-"), modifier-head type (e.g., "-"), and others (e.g., "-"). Finally, the agreement scores also significantly influenced reading behaviors, as evidenced by analyses with published eye movement data. Specifically, a high level of disagreement was associated with longer single fixation durations. We discuss the implications of these results and highlight how the CWSA corpus can facilitate future research on word segmentation in Chinese reading.

摘要

没有明确的词边界是中文文字的一个显著特征，这使它有别于大多数字母文字，导致读者之间在词边界划分上存在分歧。以往的研究探讨了这一特征可能如何影响阅读表现。然而，需要进一步的调查以得出更具生态效度和可推广性的结果。为了增进我们对中文阅读中词边界影响的理解，我们引入了中文分词一致性（CWSA）语料库。该语料库由500个句子组成，包含9813个字符token和1590个字符类型，并提供了每个字符位置的分词一致性数据。数据显示总体分词一致性水平较高（92%）。然而，在8.96%的情况下，参与者在词边界位置上存在分歧。此外，约85%的句子至少包含一个模糊的词边界。具有高度分歧的字符串初步分为三类，即形态句法类型（如“-”）、修饰语-中心语类型（如“-”）和其他类型（如“-”）。最后，一致性得分也显著影响阅读行为，已发表的眼动数据分析证明了这一点。具体而言，高度分歧与更长的单次注视持续时间相关。我们讨论了这些结果的意义，并强调了CWSA语料库如何能够促进未来关于中文阅读中分词的研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4665/11682008/104364c3dde7/13428_2024_2528_Fig1_HTML.jpg

相似文献

A corpus of Chinese word segmentation agreement.一个中文分词一致性语料库。

Behav Res Methods. 2024 Dec 28;57(1):25. doi: 10.3758/s13428-024-02528-8.

The use of probabilistic lexicality cues for word segmentation in Chinese reading.概率性词汇线索在中文阅读分词中的应用。

Q J Exp Psychol (Hove). 2016;69(3):548-60. doi: 10.1080/17470218.2015.1061030. Epub 2015 Jul 11.

The role of format familiarity and semantic transparency in Chinese reading: evidence from eye movements.格式熟悉度和语义透明度在中国阅读中的作用：来自眼动的证据。

BMC Psychol. 2025 Mar 6;13(1):207. doi: 10.1186/s40359-025-02397-6.

The Beijing Sentence Corpus II: A cross-script comparison between traditional and simplified Chinese sentence reading.《北京句子语料库II：繁体中文与简体中文句子阅读的跨文字比较》

Behav Res Methods. 2025 Jan 17;57(2):60. doi: 10.3758/s13428-024-02523-z.

Contrasting off-line segmentation decisions with on-line word segmentation during reading.阅读过程中离线分词决策与在线分词的对比。

Br J Psychol. 2021 Aug;112(3):662-689. doi: 10.1111/bjop.12482. Epub 2021 Jan 19.

Word segmentation by alternating colors facilitates eye guidance in Chinese reading.交替颜色的分词有助于中文阅读中的眼球引导。

Mem Cognit. 2018 Jul;46(5):729-740. doi: 10.3758/s13421-018-0797-5.

The Beijing Sentence Corpus: A Chinese sentence corpus with eye movement data and predictability norms.北京句子语料库：具有眼动数据和可预测性规范的中文句子语料库。

Behav Res Methods. 2022 Aug;54(4):1989-2000. doi: 10.3758/s13428-021-01730-2. Epub 2021 Nov 23.

Readers extract character frequency information from nonfixated-target word at long pretarget fixations during Chinese reading.在中国阅读过程中，读者在长的目标前注视期间从非注视目标词中提取字符频率信息。

J Exp Psychol Hum Percept Perform. 2015 Oct;41(5):1409-19. doi: 10.1037/xhp0000072. Epub 2015 Jul 13.

Plausibility effects when reading one- and two-character words in Chinese: evidence from eye movements.阅读汉字单字和双字词时的似然效应：来自眼动的证据。

J Exp Psychol Learn Mem Cogn. 2012 Nov;38(6):1801-9. doi: 10.1037/a0028478. Epub 2012 May 21.

A corpus-based versus experimental examination of word- and character-frequency effects in Chinese reading: Theoretical implications for models of reading.基于语料库与实验的中文阅读中字词频效应研究：阅读模型的理论启示。

J Exp Psychol Gen. 2021 Aug;150(8):1612-1641. doi: 10.1037/xge0001014. Epub 2020 Dec 17.

引用本文的文献

The usage of a transformer based and artificial intelligence driven multidimensional feedback system in english writing instruction.一种基于变压器和人工智能驱动的多维反馈系统在英语写作教学中的应用。

Sci Rep. 2025 Jun 2;15(1):19268. doi: 10.1038/s41598-025-05026-9.

本文引用的文献

Decoding the essence of two-character Chinese words: Unveiling valence, arousal, concreteness, familiarity, and imageability through word norming.解析二字词的本质：通过词频规范揭示词义、情感、具体性、熟悉度和形象度。

Behav Res Methods. 2024 Oct;56(7):7574-7601. doi: 10.3758/s13428-024-02437-w. Epub 2024 May 15.

Eye movement control in reading Chinese: A matter of strength of character?阅读中文时的眼球运动控制：性格强弱的问题？

Acta Psychol (Amst). 2022 Oct;230:103711. doi: 10.1016/j.actpsy.2022.103711. Epub 2022 Aug 24.

GECO-CN: Ghent Eye-tracking COrpus of sentence reading for Chinese-English bilinguals.GECO-CN：用于汉英双语者的根特眼动句子阅读语料库。

Behav Res Methods. 2023 Sep;55(6):2743-2763. doi: 10.3758/s13428-022-01931-3. Epub 2022 Jul 27.

Are there preferred viewing locations in Chinese reading? Evidence from eye-tracking and computer simulations.中文阅读是否存在偏好的阅读位置？眼动追踪和计算机模拟的证据。

J Exp Psychol Learn Mem Cogn. 2023 Apr;49(4):607-625. doi: 10.1037/xlm0001142. Epub 2022 Jun 16.

Accessing Semantic Information from Above: Parafoveal Processing during the Reading of Vertically Presented Sentences in Traditional Chinese.从上方获取语义信息：在阅读繁体中文时垂直呈现句子的视幅中加工。

Cogn Sci. 2022 Feb;46(2):e13104. doi: 10.1111/cogs.13104.

Expanding horizons of cross-linguistic research on reading: The Multilingual Eye-movement Corpus (MECO).拓展阅读跨语言研究的视野：多语言眼动语料库（MECO）。

Behav Res Methods. 2022 Dec;54(6):2843-2863. doi: 10.3758/s13428-021-01772-6. Epub 2022 Feb 2.

The Beijing Sentence Corpus: A Chinese sentence corpus with eye movement data and predictability norms.北京句子语料库：具有眼动数据和可预测性规范的中文句子语料库。

Behav Res Methods. 2022 Aug;54(4):1989-2000. doi: 10.3758/s13428-021-01730-2. Epub 2021 Nov 23.

Investigating word length effects in Chinese reading.探究中文阅读中的词长效应。

J Exp Psychol Hum Percept Perform. 2018 Dec;44(12):1831-1841. doi: 10.1037/xhp0000589.

Word segmentation by alternating colors facilitates eye guidance in Chinese reading.交替颜色的分词有助于中文阅读中的眼球引导。

Mem Cognit. 2018 Jul;46(5):729-740. doi: 10.3758/s13421-018-0797-5.

MELD-SCH: A megastudy of lexical decision in simplified Chinese.MELD-SCH：简体中文词汇判断的一项巨量研究。

Behav Res Methods. 2018 Oct;50(5):1763-1777. doi: 10.3758/s13428-017-0944-0.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一个中文分词一致性语料库。

A corpus of Chinese word segmentation agreement.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献