Chen Hui-Fang, Jin Kuan-Yu
Social and Behavioral Sciences, City University of Hong Kong, Kowloon, Hong Kong.
Faculty of Education, The University of Hong Kong, Pokfulam, Hong Kong.
Front Psychol. 2018 Jul 27;9:1302. doi: 10.3389/fpsyg.2018.01302. eCollection 2018.
Conventional differential item functioning (DIF) approaches such as logistic regression (LR) often assume unidimensionality of a scale and match participants in the reference and focal groups based on total scores. However, many educational and psychological assessments are multidimensional by design, and a matching variable using total scores that does not reflect the test structure may not be good practice in multidimensional items for DIF detection. We propose the use of all subscores of a scale in LR and compare its performance with alternative matching methods, including the use of total score and individual subscores. We focused on uniform DIF situation in which 250, 500, or 1,000 participants in each group answered 21 items reflecting two dimensions, and the 21st item was the studied item. Five factors were manipulated in the study: (a) the test structure, (b) numbers of cross-loaded items, (c) group differences in latent abilities, (d) the magnitude of DIF, and (e) group sample size. The results showed that, when the studied item measured a single domain, the conventional LR incorporating total scores as a matching variable yielded inflated false positive rates (FPRs) when two groups differed in one latent ability. The situation worsened when one group had a higher ability in one domain and lower ability in another. The LR using a single subscore as the matching variable performed well in terms of FPRs and true positive rates (TPRs) when two groups did not differ in either one latent ability or differed in one latent ability. However, this approach yielded inflated FPRs when two groups differed in two latent abilities. The proposed LR using two subscores yielded well-controlled FPRs across all conditions and yielded the highest TPRs. When the studied item measured two domains, the use of either the total score or two subscores worked well in the control of FPRs and yielded similar TPRs across conditions, whereas the use of a single subscore resulted in inflated FPRs when two groups differed in one or two latent abilities. In conclusion, we recommend the use of multiple subscores to match subjects in DIF detection for multidimensional data.
传统的项目功能差异(DIF)分析方法,如逻辑回归(LR),通常假定量表具有单维性,并根据总分对参照组和目标组的参与者进行匹配。然而,许多教育和心理评估在设计上是多维的,而使用不能反映测试结构的总分作为匹配变量,在多维项目的DIF检测中可能并非良策。我们建议在逻辑回归中使用量表的所有子分数,并将其性能与其他匹配方法进行比较,包括使用总分和单个子分数。我们聚焦于均匀DIF情况,即每组250、500或1000名参与者回答反映两个维度的21个项目,第21个项目为研究项目。研究中对五个因素进行了操控:(a)测试结构,(b)交叉负荷项目的数量,(c)潜在能力的组间差异,(d)DIF的大小,以及(e)组样本量。结果表明,当研究项目测量单一领域时,在两组在一种潜在能力上存在差异时,将总分作为匹配变量的传统逻辑回归会产生过高的假阳性率(FPR)。当一组在一个领域能力较高而在另一个领域能力较低时,情况会更糟。当两组在一种潜在能力上无差异或仅在一种潜在能力上存在差异时,使用单个子分数作为匹配变量的逻辑回归在FPR和真阳性率(TPR)方面表现良好。然而,当两组在两种潜在能力上存在差异时,这种方法会产生过高的FPR。所提出的使用两个子分数的逻辑回归在所有条件下都能很好地控制FPR,并产生最高的TPR。当研究项目测量两个领域时,使用总分或两个子分数在控制FPR方面效果良好,且在不同条件下产生相似的TPR,而当两组在一种或两种潜在能力上存在差异时,使用单个子分数会导致FPR过高。总之,我们建议在多维数据的DIF检测中使用多个子分数来匹配受试者。