Kauchak David, Leroy Gondy, Hogue Alan
Computer Science Department, Pomona College, Claremont, CA.
Department of Management Information Systems, Eller College of Management, University of Arizona, Tucson, AZ.
J Assoc Inf Sci Technol. 2017 Sep;68(9):2088-2100. doi: 10.1002/asi.23855. Epub 2017 Jun 20.
Text simplification often relies on dated, unproven readability formulas. As an alternative and motivated by the success of term familiarity, we test a complementary measure: grammar familiarity. Grammar familiarity is measured as the frequency of the 3 level sentence parse tree and is useful for evaluating individual sentences. We created a database of 140K unique 3 level parse structures by parsing and binning all 5.4M sentences in English Wikipedia. We then calculated the grammar frequencies across the corpus and created 11 frequency bins. We evaluate the measure with a user study and corpus analysis. For the user study, we selected 20 sentences randomly from each bin, controlling for sentence length and term frequency, and recruited 30 readers per sentence (N=6,600) on Amazon Mechanical Turk. We measured actual difficulty (comprehension) using a Cloze test, perceived difficulty using a 5-point Likert scale, and time taken. Sentences with more frequent grammatical structures, even with very different surface presentations, were easier to understand, perceived as easier and took less time to read. Outcomes from readability formulas correlated with perceived but not with actual difficulty. Our corpus analysis shows how the metric can be used to understand grammar regularity in a broad range of corpora.
文本简化通常依赖于过时的、未经证实的可读性公式。作为一种替代方法,并受词汇熟悉度成功的启发,我们测试了一种补充指标:语法熟悉度。语法熟悉度通过三级句子解析树的频率来衡量,有助于评估单个句子。我们通过对英文维基百科中所有540万个句子进行解析和分类,创建了一个包含14万个独特三级解析结构的数据库。然后,我们计算了整个语料库中的语法频率,并创建了11个频率区间。我们通过用户研究和语料库分析来评估该指标。在用户研究中,我们从每个区间随机选择20个句子,控制句子长度和词汇频率,并在亚马逊土耳其机器人平台上为每个句子招募30名读者(N = 6600)。我们使用填空测试来测量实际难度(理解程度),使用5点李克特量表来测量感知难度,并记录阅读时间。语法结构更频繁的句子,即使表面呈现非常不同,也更容易理解,被认为更容易,阅读所需时间也更少。可读性公式的结果与感知难度相关,但与实际难度无关。我们的语料库分析展示了该指标如何用于理解广泛语料库中的语法规律性。