Etherington Thomas R
Manaaki Whenua -Landcare Research, Lincoln, New Zealand.
PeerJ. 2021 May 11;9:e11436. doi: 10.7717/peerj.11436. eCollection 2021.
The Mahalanobis distance is a statistical technique that has been used in statistics and data science for data classification and outlier detection, and in ecology to quantify species-environment relationships in habitat and ecological niche models. Mahalanobis distances are based on the location and scatter of a multivariate normal distribution, and can measure how distant any point in space is from the centre of this kind of distribution. Three different methods for calculating the multivariate location and scatter are commonly used: the sample mean and variance-covariance, the minimum covariance determinant, and the minimum volume ellipsoid. The minimum covariance determinant and minimum volume ellipsoid were developed to be robust to outliers by minimising the multivariate location and scatter for a subset of the full sample, with the proportion of the full sample forming the subset being controlled by a user-defined parameter. This outlier robustness means the minimum covariance determinant and the minimum volume ellipsoid are highly relevant for ecological niche analyses, which are usually based on natural history observations that are likely to contain errors. However, natural history observations will also contain extreme bias, to which the minimum covariance determinant and the minimum volume ellipsoid will also be sensitive. To provide guidance for selecting and parameterising a multivariate location and scatter method, a series of virtual ecological niche modelling experiments were conducted to demonstrate the performance of each multivariate location and scatter method under different levels of sample size, errors, and bias. The results show that there is no optimal modelling approach, and that choices need to be made based on the individual data and question. The sample mean and variance-covariance method will perform best on very small sample sizes if the data are free of error and bias. At larger sample sizes the minimum covariance determinant and minimum volume ellipsoid methods perform as well or better, but only if they are appropriately parameterised. Modellers who are more concerned about the prevalence of errors should retain a smaller proportion of the full data set, while modellers more concerned about the prevalence of bias should retain a larger proportion of the full data set. I conclude that Mahalanobis distances are a useful niche modelling technique, but only for questions relating to the fundamental niche of a species where the assumption of multivariate normality is reasonable. Users of the minimum covariance determinant and minimum volume ellipsoid methods must also clearly report their parameterisations so that the results can be interpreted correctly.
马氏距离是一种统计技术,已在统计学和数据科学中用于数据分类和异常值检测,在生态学中用于量化栖息地和生态位模型中的物种 - 环境关系。马氏距离基于多元正态分布的位置和离散程度,能够衡量空间中的任何一点与这种分布中心的距离有多远。通常使用三种不同的方法来计算多元位置和离散程度:样本均值和方差 - 协方差、最小协方差行列式以及最小体积椭球体。最小协方差行列式和最小体积椭球体的开发目的是通过最小化全样本子集的多元位置和离散程度来对异常值具有鲁棒性,全样本中构成子集的比例由用户定义的参数控制。这种对异常值的鲁棒性意味着最小协方差行列式和最小体积椭球体与生态位分析高度相关,生态位分析通常基于可能包含误差的自然历史观测数据。然而,自然历史观测数据也会包含极端偏差,最小协方差行列式和最小体积椭球体对此也会敏感。为了为选择多元位置和离散程度方法以及设置参数提供指导,进行了一系列虚拟生态位建模实验,以展示每种多元位置和离散程度方法在不同样本量、误差和偏差水平下的性能。结果表明,没有最优的建模方法,需要根据具体数据和问题做出选择。如果数据没有误差和偏差,样本均值和方差 - 协方差方法在非常小的样本量上表现最佳。在较大样本量时,最小协方差行列式和最小体积椭球体方法表现相同或更好,但前提是它们进行了适当的参数设置。更关注误差普遍性的建模者应保留较小比例的完整数据集,而更关注偏差普遍性的建模者应保留较大比例的完整数据集。我的结论是,马氏距离是一种有用的生态位建模技术,但仅适用于与物种基础生态位相关的问题,其中多元正态性假设是合理的。使用最小协方差行列式和最小体积椭球体方法的用户还必须清楚地报告他们的参数设置,以便能够正确解释结果。