Tsang Yiu-Kei, Yan Ming, Pan Jinger, Chan Megan Yin Kan
Department of Education Studies, Hong Kong Baptist University, Kowloon Tong, Kowloon, Hong Kong.
Centre for Learning Sciences, Hong Kong Baptist University, Kowloon, Hong Kong.
Behav Res Methods. 2024 Dec 28;57(1):25. doi: 10.3758/s13428-024-02528-8.
The absence of explicit word boundaries is a distinctive characteristic of Chinese script, setting it apart from most alphabetic scripts, leading to word boundary disagreement among readers. Previous studies have examined how this feature may influence reading performance. However, further investigations are required to generate more ecologically valid and generalizable findings. In order to advance our understanding of the impact of word boundaries in Chinese reading, we introduce the Chinese Word Segmentation Agreement (CWSA) corpus. This corpus consists of 500 sentences, comprising 9813 character tokens and 1590 character types, and provides data on word segmentation agreement at each character position. The data revealed a high level of overall segmentation agreement (92%). However, participants disagreed on the position of word boundaries in 8.96% of the cases. Moreover, about 85% of the sentences contained at least one ambiguous word boundary. The character strings with high levels of disagreement were tentatively classified into three categories, namely the morphosyntactic type (e.g., "-"), modifier-head type (e.g., "-"), and others (e.g., "-"). Finally, the agreement scores also significantly influenced reading behaviors, as evidenced by analyses with published eye movement data. Specifically, a high level of disagreement was associated with longer single fixation durations. We discuss the implications of these results and highlight how the CWSA corpus can facilitate future research on word segmentation in Chinese reading.
没有明确的词边界是中文文字的一个显著特征,这使它有别于大多数字母文字,导致读者之间在词边界划分上存在分歧。以往的研究探讨了这一特征可能如何影响阅读表现。然而,需要进一步的调查以得出更具生态效度和可推广性的结果。为了增进我们对中文阅读中词边界影响的理解,我们引入了中文分词一致性(CWSA)语料库。该语料库由500个句子组成,包含9813个字符token和1590个字符类型,并提供了每个字符位置的分词一致性数据。数据显示总体分词一致性水平较高(92%)。然而,在8.96%的情况下,参与者在词边界位置上存在分歧。此外,约85%的句子至少包含一个模糊的词边界。具有高度分歧的字符串初步分为三类,即形态句法类型(如“-”)、修饰语-中心语类型(如“-”)和其他类型(如“-”)。最后,一致性得分也显著影响阅读行为,已发表的眼动数据分析证明了这一点。具体而言,高度分歧与更长的单次注视持续时间相关。我们讨论了这些结果的意义,并强调了CWSA语料库如何能够促进未来关于中文阅读中分词的研究。