Rouse Steven V
Social Sciences Division, Pepperdine University, Malibu, CA 90263, USA.
J Pers Assess. 2007 Jun;88(3):264-75. doi: 10.1080/00223890701293908.
Reliability generalization (RG) is a meta-analytic technique that allows for the systematic examination of variation in score reliability for different samples of test takers; this procedure is based on the recognition that reliability is not a stable property of a test but is sample dependent. As a demonstration of an RG analysis, I obtained 63 reliability coefficients for each of the MMPI-2 (Butcher et al., 2001) Personality Psychopathology 5 (Harkness, McNulty, & Ben-Porath, 1995) scales. The overall variability of alpha coefficients supports the argument that reliability is sample dependent and underscores the need for researchers to calculate reliability estimates based on their research samples rather than simply citing published alpha coefficients as evidence of score reliability. I observed statistically significant mean reliability differences for scores across the 5 scales, with the highest level of reliability observed for scores on the measure of Negative Emotionality and the lowest levels of reliability observed for scores on the measures of Aggression and Disconstraint. There was no evidence that the sex-composition of a sample was systematically related to score reliability, and there were no statistically significant differences in reliability between scores obtained with the English version of the test and those obtained with translated forms. However, reliability was consistently lower for scores on some scales when the data were obtained in nonclinical settings as opposed to clinical ones. Sample size was not significantly correlated with reliability estimates. RG methods have the potential for deepening the level of understanding about the role of reliability in the evaluation and use of personality tests.
信度概化(RG)是一种元分析技术,它允许对不同考生样本的分数信度变化进行系统考察;该程序基于这样一种认识,即信度并非测验的稳定属性,而是依赖于样本。作为RG分析的一个示例,我为明尼苏达多相人格量表第二版(Butcher等人,2001年)的人格心理病理学5量表(Harkness、McNulty和Ben-Porath,1995年)中的每一个量表获得了63个信度系数。α系数的总体变异性支持了信度依赖于样本这一观点,并强调研究人员需要根据自己的研究样本计算信度估计值,而不是简单地引用已发表的α系数作为分数信度的证据。我观察到5个量表的分数在统计上存在显著的平均信度差异,其中消极情绪测量分数的信度最高,攻击和无拘束测量分数的信度最低。没有证据表明样本的性别构成与分数信度存在系统关联,并且用该测验的英文版本获得的分数与用翻译版本获得的分数在信度上没有统计学显著差异。然而,与临床环境相比,在非临床环境中获得数据时,某些量表分数的信度始终较低。样本量与信度估计值没有显著相关性。RG方法有可能加深对信度在人格测验评估和使用中的作用的理解程度。