Department of Speech-Language-Hearing Sciences, Hofstra University, Hempstead, NY.
School of Languages and Social Sciences, Aston University, Birmingham, United Kingdom.
Lang Speech Hear Serv Sch. 2020 Apr 7;51(2):479-493. doi: 10.1044/2019_LSHSS-19-00056. Epub 2020 Mar 18.
Purpose The results of automatic machine scoring of the Index of Productive Syntax from the Computerized Language ANalysis (CLAN) tools of the Child Language Data Exchange System of TalkBank (MacWhinney, 2000) were compared to manual scoring to determine the accuracy of the machine-scored method. Method Twenty transcripts of 10 children from archival data of the Weismer Corpus from the Child Language Data Exchange System at 30 and 42 months were examined. Measures of absolute point difference and point-to-point accuracy were compared, as well as points erroneously given and missed. Two new measures for evaluating automatic scoring of the Index of Productive Syntax were introduced: Machine Item Accuracy (MIA) and Cascade Failure Rate- these measures further analyze points erroneously given and missed. Differences in total scores, subscale scores, and individual structures were also reported. Results Mean absolute point difference between machine and hand scoring was 3.65, point-to-point agreement was 72.6%, and MIA was 74.9%. There were large differences in subscales, with Noun Phrase and Verb Phrase subscales generally providing greater accuracy and agreement than Question/Negation and Sentence Structures subscales. There were significantly more erroneous than missed items in machine scoring, attributed to problems of mistagging of elements, imprecise search patterns, and other errors. Cascade failure resulted in an average of 4.65 points lost per transcript. Conclusions The CLAN program showed relatively inaccurate outcomes in comparison to manual scoring on both traditional and new measures of accuracy. Recommendations for improvement of the program include accounting for second exemplar violations and applying cascaded credit, among other suggestions. It was proposed that research on machine-scored syntax routinely report accuracy measures detailing erroneous and missed scores, including MIA, so that researchers and clinicians are aware of the limitations of a machine-scoring program. Supplemental Material https://doi.org/10.23641/asha.11984364.
比较计算机语言分析(CLAN)工具中的生产语法索引的自动机器评分与手动评分的结果,以确定机器评分方法的准确性。方法:检查来自 TalkBank 的儿童语言数据交换系统(MacWhinney,2000)的 Weismer 语料库档案数据中 10 个儿童的 20 个转录本,年龄分别为 30 个月和 42 个月。比较了绝对点差和逐点准确性、误给和漏给的分数。引入了两种评估生产语法索引自动评分的新度量标准:机器项目准确性(MIA)和级联失败率——这些度量标准进一步分析了误给和漏给的分数。还报告了总分数、子量表分数和个别结构的差异。结果:机器评分与手动评分之间的平均绝对点差为 3.65,逐点一致性为 72.6%,MIA 为 74.9%。子量表的差异很大,名词短语和动词短语子量表通常比问题/否定和句子结构子量表提供更高的准确性和一致性。机器评分中误给的项目明显多于漏给的项目,这归因于元素标记错误、搜索模式不精确以及其他错误。级联故障导致每个转录本平均损失 4.65 分。结论:与手动评分相比,CLAN 程序在传统和新的准确性度量上的结果都相对不准确。该程序的改进建议包括考虑第二个示例违规并应用级联信用,以及其他建议。有人建议,对机器评分语法的研究应定期报告准确性度量标准,详细说明错误和遗漏的分数,包括 MIA,以便研究人员和临床医生了解机器评分程序的局限性。补充材料:https://doi.org/10.23641/asha.11984364。