在Rasch模型中，使用单独校准的t统计量和Mantel-Haenszel卡方统计量评估小样本中的差异性项目功能。

Assessing DIF among small samples with separate calibration t and Mantel-Haenszel χ² statistics in the Rasch model.

作者信息

Bernstein Ira, Samuels Ellery, Woo Ada, Hagge Sarah L

机构信息

National Council of State Boards of Nursing (NCSBN), 111 E. Wacker Drive, Ste. 2900, Chicago, IL 60601-4277, USA,

出版信息

J Appl Meas. 2013;14(4):389-99.

PMID:24064579

Abstract

The National Council Licensure Examination (NCLEX) program has evaluated differential item functioning (DIF) using the Mantel-Haenszel (M-H) chi-square statistic. Since a Rasch model is assumed, DIF implies a difference in item difficulty between a reference group, e.g., White applicants, and a focal group, e.g., African-American applicants. The National Council of State Boards of Nursing (NCSBN) is planning to change the statistic used to evaluate DIF on the NCLEX from M-H to the separate calibration t-test (t). In actuality, M-H and t should yield identical results in large samples if the assumptions of the Rasch model hold (Linacre and Wright, 1989, also see Smith, 1996). However, as is true throughout statistics, "how large is large" is undefined, so it is quite possible that systematic differences exist in relatively smaller samples. This paper compares M-H and t in four sets of computer simulations. Three simulations used a ten-item test with nine fair items and one potentially containing DIF. To address instability that may result from a ten-item test, the fourth used a 30-item test with 29 fair items and one potentially containing DIF. Depending upon the simulation, the magnitude of population DIF (0, .5, 1.0, and 1.5 z-score units), the ability difference between the focal and reference group (-1, 0, and 1 z-score units), the focal group size (0, 10, 20, 40, 50, 80, 160, and 1000), and the reference group size (500 and 1000) were varied. The results were that: (a) differences in estimated DIF between the M-H and t statistics are generally small, (b) t tends to estimate lower chance probabilities than M-H with small sample sizes, (c) neither method is likely to detect DIF, especially when it is of slight magnitude in small focal group sizes, and (d) M-H does marginally better than t at detecting DIF but this improvement is also limited to very small focal group sizes.

摘要

国家委员会执照考试（NCLEX）项目一直使用曼特尔 - 亨塞尔（M - H）卡方统计量来评估项目功能差异（DIF）。由于假定采用拉施模型，DIF意味着在一个参照组（如白人申请者）和一个焦点组（如非裔美国申请者）之间项目难度存在差异。美国国家州护士委员会（NCSBN）正计划将用于评估NCLEX上DIF的统计量从M - H改为单独校准t检验（t）。实际上，如果拉施模型的假设成立，在大样本中M - H和t应该会得出相同的结果（林纳克和赖特，1989年，另见史密斯，1996年）。然而，正如统计学中常见的那样，“多大算大”并无明确界定，所以在相对较小的样本中很可能存在系统差异。本文在四组计算机模拟中比较了M - H和t。三次模拟使用了一个包含九个公平项目和一个可能存在DIF的项目的十项测试。为了解决十项测试可能导致的不稳定性问题，第四次模拟使用了一个包含29个公平项目和一个可能存在DIF的项目的30项测试。根据模拟情况，总体DIF的大小（0、0.5、1.0和1.5个z分数单位）、焦点组和参照组之间的能力差异（-1、0和1个z分数单位）、焦点组规模（0、10、20、40、50、80、160和1000）以及参照组规模（500和1000）均有所变化。结果表明：（a）M - H和t统计量在估计DIF方面的差异通常较小；（b）在小样本量时，t倾向于比M - H估计出更低的概率；（c）两种方法都不太可能检测到DIF，尤其是当DIF在小焦点组规模中程度轻微时；（d）在检测DIF方面，M - H比t略好一些，但这种改进也仅限于非常小的焦点组规模。