Leung Chung Hong Danny, Chow Mei Yung Vanliza, Ge Haoyan
School of Education and Languages, Hong Kong Metropolitan University, Ho Man Tin, Kowloon, Hong Kong Special Administrative Region.
Data Brief. 2022 Aug 8;44:108527. doi: 10.1016/j.dib.2022.108527. eCollection 2022 Oct.
This data article presents the development of a learner corpus (i.e. a systematic computerized web-based repository of written texts produced by language learners) from the initial phase of the development where written assignments were collected from language learners as raw data to the critical phases where the processed text data and meta data were aligned and transformed to the web interface of the corpus. The corpus developed is called the CELL (Chinese and English Learner Language) Corpus, which comprises: i) text data containing 4.2 million English words and 18 million Chinese characters; and ii) meta data including the demographic information of the participants whose text data were collected. This article first outlines the steps for collecting the text data and meta data and then explains the processes for cleaning, annotating and tagging the text data. Discussion of the problems the research team encountered with segmentation of the Chinese text data and accuracy check of the processed datasets is also included in this article. The CELL Corpus comes with the concordance and word list features which will enable language teachers and researchers to investigate frequency, accuracy and complexity of vocabulary use in learner language. The steps and processes reported in this article will inform future development of learner language corpora of different languages.
本数据文章介绍了一个学习者语料库(即一个基于网络的、系统的、计算机化的语言学习者书面文本库)的开发过程,从开发的初始阶段开始,当时从语言学习者那里收集书面作业作为原始数据,到关键阶段,即处理后的文本数据和元数据进行对齐并转换为语料库的网络界面。所开发的语料库称为CELL(中国与英语学习者语言)语料库,它包括:i)包含420万个英语单词和1800万个汉字的文本数据;以及ii)元数据,包括收集其文本数据的参与者的人口统计信息。本文首先概述了收集文本数据和元数据的步骤,然后解释了清理、注释和标记文本数据的过程。本文还讨论了研究团队在中文文本数据分词和处理后数据集准确性检查方面遇到的问题。CELL语料库具有索引和词表功能,这将使语言教师和研究人员能够研究学习者语言中词汇使用的频率、准确性和复杂性。本文报告的步骤和过程将为未来不同语言的学习者语言语料库的开发提供参考。