Teresi Jeanne A, Ocepek-Welikson Katja, Kleinman Marjorie, Eimicke Joseph P, Crane Paul K, Jones Richard N, Lai Jin-Shei, Choi Seung W, Hays Ron D, Reeve Bryce B, Reise Steven P, Pilkonis Paul A, Cella David
Columbia University Stroud Center; Faculty of Medicine, New York State Psychiatric Institute.
Psychol Sci Q. 2009;51(2):148-180.
The aims of this paper are to present findings related to differential item functioning (DIF) in the Patient Reported Outcome Measurement Information System (PROMIS) depression item bank, and to discuss potential threats to the validity of results from studies of DIF. The 32 depression items studied were modified from several widely used instruments. DIF analyses of gender, age and education were performed using a sample of 735 individuals recruited by a survey polling firm. DIF hypotheses were generated by asking content experts to indicate whether or not they expected DIF to be present, and the direction of the DIF with respect to the studied comparison groups. Primary analyses were conducted using the graded item response model (for polytomous, ordered response category data) with likelihood ratio tests of DIF, accompanied by magnitude measures. Sensitivity analyses were performed using other item response models and approaches to DIF detection. Despite some caveats, the items that are recommended for exclusion or for separate calibration were "I felt like crying" and "I had trouble enjoying things that I used to enjoy." The item, "I felt I had no energy," was also flagged as evidencing DIF, and recommended for additional review. On the one hand, false DIF detection (Type 1 error) was controlled to the extent possible by ensuring model fit and purification. On the other hand, power for DIF detection might have been compromised by several factors, including sparse data and small sample sizes. Nonetheless, practical and not just statistical significance should be considered. In this case the overall magnitude and impact of DIF was small for the groups studied, although impact was relatively large for some individuals.
本文旨在呈现与患者报告结局测量信息系统(PROMIS)抑郁项目库中差异项目功能(DIF)相关的研究结果,并讨论DIF研究结果有效性的潜在威胁。所研究的32个抑郁项目是从几种广泛使用的工具中修改而来的。使用一家调查民意测验公司招募的735名个体样本,对性别、年龄和教育程度进行了DIF分析。通过询问内容专家来生成DIF假设,以表明他们是否预期存在DIF,以及DIF相对于所研究比较组的方向。主要分析采用分级项目反应模型(用于多分类、有序反应类别数据)以及DIF的似然比检验,并伴有效应量测量。使用其他项目反应模型和DIF检测方法进行了敏感性分析。尽管存在一些注意事项,但建议排除或单独校准的项目是“我想哭”和“我难以享受曾经喜欢的事情”。“我觉得自己没有精力”这一项目也被标记为存在DIF证据,并建议进行进一步审查。一方面,通过确保模型拟合和净化尽可能地控制了错误的DIF检测(I型错误)。另一方面,DIF检测的效能可能受到了几个因素的影响,包括数据稀疏和样本量小。尽管如此,应考虑实际意义而非仅仅是统计学意义。在这种情况下,对于所研究的群体,DIF的总体效应量和影响较小,尽管对某些个体的影响相对较大。