用于生态位建模和异常值检测的马氏距离：样本量、误差和偏差对选择和参数化多元位置与离散方法的影响

Mahalanobis distances for ecological niche modelling and outlier detection: implications of sample size, error, and bias for selecting and parameterising a multivariate location and scatter method.

作者信息

Etherington Thomas R

机构信息

Manaaki Whenua -Landcare Research, Lincoln, New Zealand.

出版信息

PeerJ. 2021 May 11;9:e11436. doi: 10.7717/peerj.11436. eCollection 2021.

DOI:10.7717/peerj.11436

PMID:34026369

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8121071/

Abstract

The Mahalanobis distance is a statistical technique that has been used in statistics and data science for data classification and outlier detection, and in ecology to quantify species-environment relationships in habitat and ecological niche models. Mahalanobis distances are based on the location and scatter of a multivariate normal distribution, and can measure how distant any point in space is from the centre of this kind of distribution. Three different methods for calculating the multivariate location and scatter are commonly used: the sample mean and variance-covariance, the minimum covariance determinant, and the minimum volume ellipsoid. The minimum covariance determinant and minimum volume ellipsoid were developed to be robust to outliers by minimising the multivariate location and scatter for a subset of the full sample, with the proportion of the full sample forming the subset being controlled by a user-defined parameter. This outlier robustness means the minimum covariance determinant and the minimum volume ellipsoid are highly relevant for ecological niche analyses, which are usually based on natural history observations that are likely to contain errors. However, natural history observations will also contain extreme bias, to which the minimum covariance determinant and the minimum volume ellipsoid will also be sensitive. To provide guidance for selecting and parameterising a multivariate location and scatter method, a series of virtual ecological niche modelling experiments were conducted to demonstrate the performance of each multivariate location and scatter method under different levels of sample size, errors, and bias. The results show that there is no optimal modelling approach, and that choices need to be made based on the individual data and question. The sample mean and variance-covariance method will perform best on very small sample sizes if the data are free of error and bias. At larger sample sizes the minimum covariance determinant and minimum volume ellipsoid methods perform as well or better, but only if they are appropriately parameterised. Modellers who are more concerned about the prevalence of errors should retain a smaller proportion of the full data set, while modellers more concerned about the prevalence of bias should retain a larger proportion of the full data set. I conclude that Mahalanobis distances are a useful niche modelling technique, but only for questions relating to the fundamental niche of a species where the assumption of multivariate normality is reasonable. Users of the minimum covariance determinant and minimum volume ellipsoid methods must also clearly report their parameterisations so that the results can be interpreted correctly.

摘要

马氏距离是一种统计技术，已在统计学和数据科学中用于数据分类和异常值检测，在生态学中用于量化栖息地和生态位模型中的物种 - 环境关系。马氏距离基于多元正态分布的位置和离散程度，能够衡量空间中的任何一点与这种分布中心的距离有多远。通常使用三种不同的方法来计算多元位置和离散程度：样本均值和方差 - 协方差、最小协方差行列式以及最小体积椭球体。最小协方差行列式和最小体积椭球体的开发目的是通过最小化全样本子集的多元位置和离散程度来对异常值具有鲁棒性，全样本中构成子集的比例由用户定义的参数控制。这种对异常值的鲁棒性意味着最小协方差行列式和最小体积椭球体与生态位分析高度相关，生态位分析通常基于可能包含误差的自然历史观测数据。然而，自然历史观测数据也会包含极端偏差，最小协方差行列式和最小体积椭球体对此也会敏感。为了为选择多元位置和离散程度方法以及设置参数提供指导，进行了一系列虚拟生态位建模实验，以展示每种多元位置和离散程度方法在不同样本量、误差和偏差水平下的性能。结果表明，没有最优的建模方法，需要根据具体数据和问题做出选择。如果数据没有误差和偏差，样本均值和方差 - 协方差方法在非常小的样本量上表现最佳。在较大样本量时，最小协方差行列式和最小体积椭球体方法表现相同或更好，但前提是它们进行了适当的参数设置。更关注误差普遍性的建模者应保留较小比例的完整数据集，而更关注偏差普遍性的建模者应保留较大比例的完整数据集。我的结论是，马氏距离是一种有用的生态位建模技术，但仅适用于与物种基础生态位相关的问题，其中多元正态性假设是合理的。使用最小协方差行列式和最小体积椭球体方法的用户还必须清楚地报告他们的参数设置，以便能够正确解释结果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ccd/8121071/cdf28e7f8c4b/peerj-09-11436-g001.jpg

相似文献

Mahalanobis distances for ecological niche modelling and outlier detection: implications of sample size, error, and bias for selecting and parameterising a multivariate location and scatter method.用于生态位建模和异常值检测的马氏距离：样本量、误差和偏差对选择和参数化多元位置与离散方法的影响

PeerJ. 2021 May 11;9:e11436. doi: 10.7717/peerj.11436. eCollection 2021.

Mahalanobis distances and ecological niche modelling: correcting a chi-squared probability error.马氏距离与生态位建模：纠正卡方概率误差

PeerJ. 2019 Apr 2;7:e6678. doi: 10.7717/peerj.6678. eCollection 2019.

Outlier detection in multivariate analytical chemical data.多元分析化学数据中的异常值检测

Anal Chem. 1998 Jun 1;70(11):2372-9. doi: 10.1021/ac970763d.

Distribution of variables by method of outlier detection.按异常值检测方法划分的变量分布

Front Psychol. 2012 Jul 5;3:211. doi: 10.3389/fpsyg.2012.00211. eCollection 2012.

The factorial decomposition of the Mahalanobis distances in habitat selection studies.栖息地选择研究中马氏距离的因子分解

Ecology. 2008 Feb;89(2):555-66. doi: 10.1890/06-1750.1.

Outlier modeling for spectral data reduction.用于光谱数据缩减的异常值建模

J Opt Soc Am A Opt Image Sci Vis. 2014 Jul 1;31(7):1445-52. doi: 10.1364/JOSAA.31.001445.

Locally centred Mahalanobis distance: a new distance measure with salient features towards outlier detection.局部中心马氏距离：一种新的距离度量方法，具有显著的异常值检测特征。

Anal Chim Acta. 2013 Jul 17;787:1-9. doi: 10.1016/j.aca.2013.04.034. Epub 2013 Apr 27.

Protein-protein interaction site predictions with minimum covariance determinant and Mahalanobis distance.基于最小协方差行列式和马氏距离的蛋白质-蛋白质相互作用位点预测

J Theor Biol. 2017 Nov 21;433:57-63. doi: 10.1016/j.jtbi.2017.08.026. Epub 2017 Sep 1.

On the calculation of a robust S-estimator of a covariance matrix.

Stat Med. 1998 Dec 15;17(23):2685-95. doi: 10.1002/(sici)1097-0258(19981215)17:23<2685::aid-sim35>3.0.co;2-w.

The utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ONDRI project.多元离群值检测技术在大型研究中数据质量评估的效用：ONDRI 项目中的应用。

BMC Med Res Methodol. 2019 May 15;19(1):102. doi: 10.1186/s12874-019-0737-5.

引用本文的文献

Progress and new challenges in image-based profiling.基于图像的分析技术的进展与新挑战。

ArXiv. 2025 Aug 7:arXiv:2508.05800v1.

Predicting Post-surgery Discharge Time in Pediatric Patients Using Machine Learning.使用机器学习预测儿科患者术后出院时间

Transl Med UniSa. 2024 Jul 18;26(1):69-80. doi: 10.37825/2239-9747.1055. eCollection 2024.

Enhancing prediction and inference of daily in-stream nutrient and sediment concentrations using an extreme gradient boosting based water quality estimation tool - XGBest.使用基于极端梯度提升的水质估算工具XGBest增强对每日河流中营养物质和沉积物浓度的预测与推断。

Sci Total Environ. 2025 Feb 1;963:178517. doi: 10.1016/j.scitotenv.2025.178517. Epub 2025 Jan 20.

Factors influencing postpartum depression in Saudi women: a cross-sectional descriptive study.影响沙特女性产后抑郁的因素：一项横断面描述性研究。

Womens Health Nurs. 2024 Jun;30(2):164-173. doi: 10.4069/whn.2024.06.18. Epub 2024 Jun 28.

Predicting the dispersal and invasion dynamics of ambrosia beetles through demographic reconstruction and process-explicit modeling.通过人口重建和过程明确建模预测粉蠹的扩散和入侵动态。

Sci Rep. 2024 Mar 30;14(1):7561. doi: 10.1038/s41598-024-57590-1.

Knowledge-based quality assurance of a comprehensive set of organ at risk contours for head and neck radiotherapy.基于知识的头颈部放射治疗全面危及器官轮廓集的质量保证。

Front Oncol. 2024 Feb 29;14:1295251. doi: 10.3389/fonc.2024.1295251. eCollection 2024.

A Novel Information-Theory-Based Genetic Distance That Approximates Phenotypic Differences.一种基于信息论的新遗传距离，可近似表型差异。

J Comput Biol. 2023 Apr;30(4):420-431. doi: 10.1089/cmb.2022.0395. Epub 2023 Jan 3.

Nurses' professional values scale‒three: Validation and psychometric appraisal among Saudi undergraduate student nurses.护士职业价值观量表三：沙特本科护生中的效度验证与心理测量学评估

J Taibah Univ Med Sci. 2022 Apr 27;17(5):737-746. doi: 10.1016/j.jtumed.2022.04.001. eCollection 2022 Oct.

Application of improved and optimized fuzzy neural network in classification evaluation of top coal cavability.改进优化模糊神经网络在顶煤可冒性分类评价中的应用。

Sci Rep. 2021 Sep 28;11(1):19179. doi: 10.1038/s41598-021-98630-4.

本文引用的文献

Supraspecific units in correlative niche modeling improves the prediction of geographic potential of biological invasions.相关生态位建模中的超特定单元可改善对生物入侵地理潜力的预测。

PeerJ. 2020 Dec 22;8:e10454. doi: 10.7717/peerj.10454. eCollection 2020.

Geographic abundance patterns explained by niche centrality hypothesis in two Chagas disease vectors in Latin America.地理丰度模式解释了拉丁美洲两种恰加斯病传播媒介的生态位中心性假说。

PLoS One. 2020 Nov 4;15(11):e0241710. doi: 10.1371/journal.pone.0241710. eCollection 2020.

No one-size-fits-all solution to clean GBIF.没有适用于清理全球生物多样性信息设施（GBIF）的一刀切的解决方案。

PeerJ. 2020 Sep 28;8:e9916. doi: 10.7717/peerj.9916. eCollection 2020.

Relationships between population densities and niche-centroid distances in North American birds.北美鸟类的种群密度与生态位中心距离之间的关系。

Ecol Lett. 2020 Mar;23(3):555-564. doi: 10.1111/ele.13453. Epub 2020 Jan 15.

Mahalanobis distances and ecological niche modelling: correcting a chi-squared probability error.马氏距离与生态位建模：纠正卡方概率误差

PeerJ. 2019 Apr 2;7:e6678. doi: 10.7717/peerj.6678. eCollection 2019.

Characterizing environmental suitability of Aedes albopictus (Diptera: Culicidae) in Mexico based on regional and global niche models.基于区域和全球生态位模型对墨西哥白纹伊蚊（双翅目：蚊科）的环境适宜性进行特征描述。

J Med Entomol. 2018 Jan 10;55(1):69-77. doi: 10.1093/jme/tjx185.

Climatologies at high resolution for the earth's land surface areas.高分辨率地球陆地区域气候概况。

Sci Data. 2017 Sep 5;4:170122. doi: 10.1038/sdata.2017.122.

Multidimensional biases, gaps and uncertainties in global plant occurrence information.全球植物分布信息中的多维偏差、差距和不确定性。

Ecol Lett. 2016 Aug;19(8):992-1006. doi: 10.1111/ele.12624. Epub 2016 Jun 2.

Do Hypervolumes Have Holes?超体积有漏洞吗？

Am Nat. 2016 Apr;187(4):E93-105. doi: 10.1086/685444. Epub 2016 Feb 15.

Sample selection bias and presence-only distribution models: implications for background and pseudo-absence data.样本选择偏差与仅存在分布模型：对背景数据和伪缺失数据的影响

Ecol Appl. 2009 Jan;19(1):181-97. doi: 10.1890/07-2153.1.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于生态位建模和异常值检测的马氏距离：样本量、误差和偏差对选择和参数化多元位置与离散方法的影响

Mahalanobis distances for ecological niche modelling and outlier detection: implications of sample size, error, and bias for selecting and parameterising a multivariate location and scatter method.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献