距离分布的理论性质及用于最近邻特征选择的新度量

Theoretical properties of distance distributions and novel metrics for nearest-neighbor feature selection.

作者信息

Dawkins Bryan A, Le Trang T, McKinney Brett A

机构信息

Genes and Human Disease, Oklahoma Medical Research Foundation, Oklahoma City, Oklahoma, United States of America.

Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, United States of America.

出版信息

PLoS One. 2021 Feb 8;16(2):e0246761. doi: 10.1371/journal.pone.0246761. eCollection 2021.

DOI:10.1371/journal.pone.0246761

PMID:33556091

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7870093/

Abstract

The performance of nearest-neighbor feature selection and prediction methods depends on the metric for computing neighborhoods and the distribution properties of the underlying data. Recent work to improve nearest-neighbor feature selection algorithms has focused on new neighborhood estimation methods and distance metrics. However, little attention has been given to the distributional properties of pairwise distances as a function of the metric or data type. Thus, we derive general analytical expressions for the mean and variance of pairwise distances for Lq metrics for normal and uniform random data with p attributes and m instances. The distribution moment formulas and detailed derivations provide a resource for understanding the distance properties for metrics and data types commonly used with nearest-neighbor methods, and the derivations provide the starting point for the following novel results. We use extreme value theory to derive the mean and variance for metrics that are normalized by the range of each attribute (difference of max and min). We derive analytical formulas for a new metric for genetic variants, which are categorical variables that occur in genome-wide association studies (GWAS). The genetic distance distributions account for minor allele frequency and the transition/transversion ratio. We introduce a new metric for resting-state functional MRI data (rs-fMRI) and derive its distance distribution properties. This metric is applicable to correlation-based predictors derived from time-series data. The analytical means and variances are in strong agreement with simulation results. We also use simulations to explore the sensitivity of the expected means and variances in the presence of correlation and interactions in the data. These analytical results and new metrics can be used to inform the optimization of nearest neighbor methods for a broad range of studies, including gene expression, GWAS, and fMRI data.

摘要

最近邻特征选择和预测方法的性能取决于用于计算邻域的度量以及基础数据的分布特性。近期改进最近邻特征选择算法的工作主要集中在新的邻域估计方法和距离度量上。然而，对于成对距离作为度量或数据类型函数的分布特性关注甚少。因此，我们推导了具有(p)个属性和(m)个实例的正态和均匀随机数据的(L_q)度量的成对距离均值和方差的一般解析表达式。分布矩公式和详细推导为理解最近邻方法常用的度量和数据类型的距离特性提供了资源，并且这些推导为以下新结果提供了起点。我们使用极值理论来推导通过每个属性的范围（最大值与最小值之差）进行归一化的度量的均值和方差。我们推导了一种用于遗传变异的新度量的解析公式，遗传变异是全基因组关联研究（GWAS）中出现的分类变量。遗传距离分布考虑了次要等位基因频率和转换/颠换比。我们为静息态功能磁共振成像数据（rs - fMRI）引入了一种新度量，并推导了其距离分布特性。此度量适用于从时间序列数据导出的基于相关性的预测器。解析均值和方差与模拟结果高度一致。我们还使用模拟来探索数据中存在相关性和相互作用时预期均值和方差的敏感性。这些解析结果和新度量可用于为广泛的研究（包括基因表达、GWAS和fMRI数据）优化最近邻方法提供参考。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7bf5/7870093/3c42cb5c537f/pone.0246761.g001.jpg

相似文献

Theoretical properties of distance distributions and novel metrics for nearest-neighbor feature selection.距离分布的理论性质及用于最近邻特征选择的新度量

PLoS One. 2021 Feb 8;16(2):e0246761. doi: 10.1371/journal.pone.0246761. eCollection 2021.

Transition-transversion encoding and genetic relationship metric in ReliefF feature selection improves pathway enrichment in GWAS.ReliefF特征选择中的转换-颠换编码和遗传关系度量可改善全基因组关联研究中的通路富集。

BioData Min. 2018 Nov 3;11:23. doi: 10.1186/s13040-018-0186-4. eCollection 2018.

Distance metric learning based on the class center and nearest neighbor relationship.基于类中心和最近邻关系的距离度量学习。

Neural Netw. 2023 Jul;164:631-644. doi: 10.1016/j.neunet.2023.05.004. Epub 2023 May 10.

STatistical Inference Relief (STIR) feature selection.统计推断缓解（STIR）特征选择。

Bioinformatics. 2019 Apr 15;35(8):1358-1365. doi: 10.1093/bioinformatics/bty788.

Clustering gene expression data with a penalized graph-based metric.基于惩罚图度量的基因表达数据聚类。

BMC Bioinformatics. 2011 Jan 4;12:2. doi: 10.1186/1471-2105-12-2.

Severe limitations of the FEve metric of functional evenness and some alternative metrics.功能均匀度的FEve指标及一些替代指标的严重局限性。

Ecol Evol. 2020 Dec 21;11(1):123-132. doi: 10.1002/ece3.6974. eCollection 2021 Jan.

The evolution of Queensland spiny mountain crayfish of the genus Euastacus. I. Testing vicariance and dispersal with interspecific mitochondrial DNA.尤氏螯虾属昆士兰多刺山小龙虾的演化。一、利用种间线粒体DNA检验隔离分化和扩散

Evolution. 2004 May;58(5):1073-85. doi: 10.1111/j.0014-3820.2004.tb00441.x.

Bias Reduction and Metric Learning for Nearest-Neighbor Estimation of Kullback-Leibler Divergence.用于Kullback-Leibler散度最近邻估计的偏差减少与度量学习

Neural Comput. 2018 Jul;30(7):1930-1960. doi: 10.1162/neco_a_01092. Epub 2018 Jun 14.

General Approach to Estimate Error Bars for Quantitative Structure-Activity Relationship Predictions of Molecular Activity.定量构效关系预测分子活性的误差估计的一般方法。

J Chem Inf Model. 2018 Aug 27;58(8):1561-1575. doi: 10.1021/acs.jcim.8b00114. Epub 2018 Jul 17.

Nearest-neighbor Projected-Distance Regression (NPDR) for detecting network interactions with adjustments for multiple tests and confounding.最近邻投影距离回归 (NPDR) 用于检测网络交互，同时调整多重检验和混杂因素。

Bioinformatics. 2020 May 1;36(9):2770-2777. doi: 10.1093/bioinformatics/btaa024.

引用本文的文献

Multivariate Optimization of k for k-Nearest-Neighbor Feature Selection With Dichotomous Outcomes: Complex Associations, Class Imbalance, and Application to RNA-Seq in Major Depressive Disorder.二分结果的k近邻特征选择中k的多变量优化：复杂关联、类不平衡以及在重度抑郁症RNA测序中的应用

IEEE Trans Comput Biol Bioinform. 2025 Jan-Feb;22(1):39-51. doi: 10.1109/TCBBIO.2024.3494599.

Centrality nearest-neighbor projected-distance regression (C-NPDR) feature selection for correlation-based predictors with application to resting-state fMRI study of major depressive disorder.基于相关性预测指标的中心性最近邻投影距离回归（C-NPDR）特征选择及其在重度抑郁症静息态功能磁共振成像研究中的应用

PLoS One. 2025 Mar 6;20(3):e0319346. doi: 10.1371/journal.pone.0319346. eCollection 2025.

Multi-Input data ASsembly for joint Analysis (MIASA): A framework for the joint analysis of disjoint sets of variables.多输入数据联合分析（MIASA）：用于不相交变量集联合分析的框架。

PLoS One. 2024 May 10;19(5):e0302425. doi: 10.1371/journal.pone.0302425. eCollection 2024.

Novel HLA associations with outcomes of Mycobacterium tuberculosis exposure and sarcoidosis in individuals of African ancestry using nearest-neighbor feature selection.利用最近邻特征选择研究非洲裔人群中结核分枝杆菌暴露和结节病结局的新型 HLA 相关性。

Genet Epidemiol. 2022 Oct;46(7):463-474. doi: 10.1002/gepi.22490. Epub 2022 Jun 14.

本文引用的文献

Consensus features nested cross-validation.共识特征嵌套交叉验证。

Bioinformatics. 2020 May 1;36(10):3093-3098. doi: 10.1093/bioinformatics/btaa046.

Bioinformatics. 2020 May 1;36(9):2770-2777. doi: 10.1093/bioinformatics/btaa024.

Density distribution of gene expression profiles and evaluation of using maximal information coefficient to identify differentially expressed genes.基因表达谱密度分布及最大信息系数在差异表达基因识别中的应用评价。

PLoS One. 2019 Jul 17;14(7):e0219551. doi: 10.1371/journal.pone.0219551. eCollection 2019.

VASC: Dimension Reduction and Visualization of Single-cell RNA-seq Data by Deep Variational Autoencoder.VASC：基于深度变分自动编码器的单细胞 RNA-seq 数据降维和可视化。

Genomics Proteomics Bioinformatics. 2018 Oct;16(5):320-331. doi: 10.1016/j.gpb.2018.08.003. Epub 2018 Dec 18.

BioData Min. 2018 Nov 3;11:23. doi: 10.1186/s13040-018-0186-4. eCollection 2018.

STatistical Inference Relief (STIR) feature selection.统计推断缓解（STIR）特征选择。

Bioinformatics. 2019 Apr 15;35(8):1358-1365. doi: 10.1093/bioinformatics/bty788.

Relief-based feature selection: Introduction and review.基于缓解的特征选择：介绍与综述。

J Biomed Inform. 2018 Sep;85:189-203. doi: 10.1016/j.jbi.2018.07.014. Epub 2018 Jul 18.

Benchmarking relief-based feature selection methods for bioinformatics data mining.基于基准的特征选择方法在生物信息学数据挖掘中的应用。

J Biomed Inform. 2018 Sep;85:168-188. doi: 10.1016/j.jbi.2018.07.015. Epub 2018 Jul 17.

Tulsa 1000: a naturalistic study protocol for multilevel assessment and outcome prediction in a large psychiatric sample.塔尔萨1000：一项针对大型精神病学样本进行多层次评估和结果预测的自然主义研究方案。

BMJ Open. 2018 Jan 24;8(1):e016620. doi: 10.1136/bmjopen-2017-016620.

Differential privacy-based evaporative cooling feature selection and classification with relief-F and random forests.基于差分隐私的 Relief-F 和随机森林蒸发冷却特征选择与分类。

Bioinformatics. 2017 Sep 15;33(18):2906-2913. doi: 10.1093/bioinformatics/btx298.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

距离分布的理论性质及用于最近邻特征选择的新度量

Theoretical properties of distance distributions and novel metrics for nearest-neighbor feature selection.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献