Elder Catherine, McNamara Tim, Congdon Peter
Department of Applied Language Studies and Linguistics, University of Auckland, Private BAG 92019, Auckland, New Zealand.
J Appl Meas. 2003;4(2):181-97.
The use of common tasks and rating procedures when assessing the communicative skills of students from highly diverse linguistic and cultural backgrounds poses particular measurement challenges, which have thus far received little research attention. If assessment tasks or criteria are found to function differentially for particular subpopulations within a test candidature with the same or a similar level of criterion ability, then the test is open to charges of bias in favour of one or other group. While there have been numerous studies involving dichotomous language test items (see e.g. Chen and Henning, 1985 and more recently Elder, 1996) few studies have considered the issue of bias in relation to performance based tasks which are assessed subjectively, via analytic and holistic rating scales. The paper demonstrates how Rasch analytic procedures can be applied to the investigation of item bias or differential item functioning (DIF) in both dichotomous and scalar items on a test of English for academic purposes. The data were gathered from a pilot English language test administered to a representative sample of undergraduate students (N= 139) enrolled in their first year of study at an English-medium university. The sample included native speakers of English who had completed up to 12 years of secondary schooling in their first language (L1) and immigrant students, mainly from Asian language backgrounds, with varying degrees of prior English language instruction and exposure. The purpose of the test was to diagnose the academic English needs of incoming undergraduates so that additional support could be offered to those deemed at risk of failure in their university study. Some of the tasks included in the assessment procedure involved objectively-scored items (measuring vocabulary knowledge, text-editing skills and reading and listening comprehension) whereas others (i.e. a report and an argumentative writing task) were subjectively-scored. The study models a methodology for estimating bias with both dichotomous and scalar items using the programs Quest (Adams and Khoo, 1993) for the former and ConQuest (Wu, Adams and Wilson, 1998) for the latter. It also offers answers to the practical questions of whether a common set of assessment criteria can, in an academic context such as this one, be meaningfully applied to all subgroups within the candidature and whether analytic criteria are more susceptible to biased ratings than holistic ones. Implications for test fairness and test validity are discussed.
在评估来自高度多样化语言和文化背景的学生的沟通技能时,使用通用任务和评分程序会带来特殊的测量挑战,而这些挑战迄今为止很少受到研究关注。如果发现评估任务或标准对于具有相同或相似标准能力水平的特定子群体在测试候选者中具有不同的功能,那么该测试就容易被指责偏向于某一个或其他群体。虽然已经有许多涉及二分法语言测试项目的研究(例如,Chen和Henning,1985年,以及最近的Elder,1996年),但很少有研究考虑与基于表现的任务相关的偏差问题,这些任务是通过分析性和整体性评分量表进行主观评估的。本文展示了如何将拉施分析程序应用于学术用途英语测试中二分法和标度项目的项目偏差或项目功能差异(DIF)的调查。数据来自对一所采用英语教学的大学一年级本科学生的代表性样本(N = 139)进行的一次英语语言测试试点。样本包括以英语为母语、用母语完成了至多12年中等教育的学生,以及主要来自亚洲语言背景、有不同程度先前英语语言教学和接触经历的移民学生。该测试的目的是诊断即将入学的本科生的学术英语需求,以便为那些被认为有大学学习失败风险的学生提供额外支持。评估程序中包含的一些任务涉及客观评分项目(测量词汇知识、文本编辑技能以及阅读和听力理解),而其他任务(即一份报告和一项议论文写作任务)则是主观评分。该研究为使用Quest程序(Adams和Khoo,1993年)评估前者以及使用ConQuest程序(Wu、Adams和Wilson,1998年)评估后者的二分法和标度项目估计偏差建立了一种方法。它还回答了一些实际问题,即在这样一个学术背景下,一套通用的评估标准是否能够有意义地应用于测试候选者中的所有子群体,以及分析性标准是否比整体性标准更容易受到有偏差评分的影响。文中讨论了对测试公平性和测试有效性的影响。