Finch W Holmes
Department of Educational Psychology, Ball State University Muncie, IN, USA.
Front Psychol. 2012 Jul 5;3:211. doi: 10.3389/fpsyg.2012.00211. eCollection 2012.
The presence of outliers can very problematic in data analysis, leading statisticians to develop a wide variety of methods for identifying them in both the univariate and multivariate contexts. In case of the latter, perhaps the most popular approach has been Mahalanobis distance, where large values suggest an observation that is unusual as compared to the center of the data. However, researchers have identified problems with the application of this metric such that its utility may be limited in some situations. As a consequence, other methods for detecting outlying observations have been developed and studied. However, a number of these approaches, while apparently robust and useful have not made their way into general practice in the social sciences. Thus, the goal of this study was to describe some of these methods and demonstrate them using a well known dataset from a popular multivariate textbook widely used in the social sciences. Results demonstrated that the methods do indeed result in datasets with very different distributional characteristics. These results are discussed in light of how they might be used by researchers and practitioners.
异常值的存在在数据分析中可能会带来很大问题,促使统计学家开发了各种各样的方法来在单变量和多变量情况下识别它们。在多变量情况下,也许最流行的方法是马氏距离,其中较大的值表明一个观察值与数据中心相比是异常的。然而,研究人员已经发现了应用这种度量标准存在的问题,以至于它的效用在某些情况下可能会受到限制。因此,已经开发并研究了其他检测异常观测值的方法。然而,这些方法中的许多方法,虽然显然稳健且有用,但尚未在社会科学的一般实践中得到应用。因此,本研究的目的是描述其中一些方法,并使用社会科学中广泛使用的一本流行多变量教科书中的一个著名数据集对它们进行演示。结果表明,这些方法确实会导致具有非常不同分布特征的数据集。根据研究人员和从业者如何使用这些结果进行了讨论。