用于基因组数据异常值检测的稳健子空间方法规避了维度诅咒。

Robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality.

作者信息

Shetta Omar, Niranjan Mahesan

机构信息

Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, UK.

出版信息

R Soc Open Sci. 2020 Feb 5;7(2):190714. doi: 10.1098/rsos.190714. eCollection 2020 Feb.

DOI:10.1098/rsos.190714

PMID:32257299

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7062061/

Abstract

The application of machine learning to inference problems in biology is dominated by supervised learning problems of regression and classification, and unsupervised learning problems of clustering and variants of low-dimensional projections for visualization. A class of problems that have not gained much attention is detecting outliers in datasets, arising from reasons such as gross experimental, reporting or labelling errors. These could also be small parts of a dataset that are functionally distinct from the majority of a population. Outlier data are often identified by considering the probability density of normal data and comparing data likelihoods against some threshold. This classical approach suffers from the curse of dimensionality, which is a serious problem with omics data which are often found in very high dimensions. We develop an outlier detection method based on structured low-rank approximation methods. The objective function includes a regularizer based on neighbourhood information captured in the graph Laplacian. Results on publicly available genomic data show that our method robustly detects outliers whereas a density-based method fails even at moderate dimensions. Moreover, we show that our method has better clustering and visualization performance on the recovered low-dimensional projection when compared with popular dimensionality reduction techniques.

摘要

机器学习在生物学推理问题中的应用主要由回归和分类的监督学习问题，以及聚类和用于可视化的低维投影变体的无监督学习问题主导。一类尚未得到太多关注的问题是检测数据集中的异常值，这些异常值是由诸如重大实验、报告或标记错误等原因引起的。它们也可能是数据集中与大多数群体在功能上不同的小部分。异常值数据通常通过考虑正常数据的概率密度并将数据似然性与某个阈值进行比较来识别。这种经典方法存在维数灾难问题，这对于经常出现在非常高维度的组学数据来说是一个严重问题。我们基于结构化低秩逼近方法开发了一种异常值检测方法。目标函数包括一个基于图拉普拉斯算子中捕获的邻域信息的正则化器。公开可用基因组数据的结果表明，我们的方法能够稳健地检测异常值，而基于密度的方法即使在中等维度下也会失败。此外，我们表明，与流行的降维技术相比，我们的方法在恢复的低维投影上具有更好的聚类和可视化性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0b97/7062061/2aaeaef60767/rsos190714-g1.jpg

相似文献

Robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality.用于基因组数据异常值检测的稳健子空间方法规避了维度诅咒。

R Soc Open Sci. 2020 Feb 5;7(2):190714. doi: 10.1098/rsos.190714. eCollection 2020 Feb.

A novel subspace outlier detection method by entropy-based clustering algorithm.一种基于熵聚类算法的新型子空间离群点检测方法。

Sci Rep. 2023 Sep 15;13(1):15331. doi: 10.1038/s41598-023-42261-4.

Unsupervised robust discriminative manifold embedding with self-expressiveness.无监督鲁棒判别流形嵌入与自表达能力。

Neural Netw. 2019 May;113:102-115. doi: 10.1016/j.neunet.2018.11.003. Epub 2019 Jan 11.

Noise-robust unsupervised spike sorting based on discriminative subspace learning with outlier handling.基于具有异常值处理的判别子空间学习的抗噪声无监督尖峰排序。

J Neural Eng. 2017 Jun;14(3):036003. doi: 10.1088/1741-2552/aa6089. Epub 2017 Feb 15.

An Ensemble Outlier Detection Method Based on Information Entropy-Weighted Subspaces for High-Dimensional Data.一种基于信息熵加权子空间的高维数据集成离群点检测方法。

Entropy (Basel). 2023 Aug 9;25(8):1185. doi: 10.3390/e25081185.

Robust dimensionality reduction via feature space to feature space distance metric learning.通过特征空间到特征空间距离度量学习实现鲁棒降维。

Neural Netw. 2019 Apr;112:1-14. doi: 10.1016/j.neunet.2019.01.001. Epub 2019 Jan 21.

caBIG VISDA: modeling, visualization, and discovery for cluster analysis of genomic data.caBIG VISDA：用于基因组数据聚类分析的建模、可视化与发现

BMC Bioinformatics. 2008 Sep 18;9:383. doi: 10.1186/1471-2105-9-383.

SLIC Superpixel-Based -Norm Robust Principal Component Analysis for Hyperspectral Image Classification.基于超像素的 SLIC-范数稳健主成分分析在高光谱图像分类中的应用。

Sensors (Basel). 2019 Jan 24;19(3):479. doi: 10.3390/s19030479.

Scatter balance: an angle-based supervised dimensionality reduction.散度平衡：一种基于角度的有监督降维方法。

IEEE Trans Neural Netw Learn Syst. 2015 Feb;26(2):277-89. doi: 10.1109/TNNLS.2014.2314698.

Laplacian Regularized Low-Rank Representation and Its Applications.拉普拉斯正则化低秩表示及其应用。

IEEE Trans Pattern Anal Mach Intell. 2016 Mar;38(3):504-17. doi: 10.1109/TPAMI.2015.2462360.

引用本文的文献

Genomic Anomaly Detection with Functional Data Analysis.基于功能数据分析的基因组异常检测

Genes (Basel). 2025 Jun 15;16(6):710. doi: 10.3390/genes16060710.

LD-informed deep learning for Alzheimer's gene loci detection using WGS data.基于全基因组测序（WGS）数据，利用LD信息的深度学习进行阿尔茨海默病基因座检测

Alzheimers Dement (N Y). 2025 Jan 16;11(1):e70041. doi: 10.1002/trc2.70041. eCollection 2025 Jan-Mar.

LD-informed deep learning for Alzheimer's gene loci detection using WGS data.基于全基因组测序（WGS）数据，利用LD信息的深度学习进行阿尔茨海默病基因座检测

medRxiv. 2024 Dec 12:2024.09.19.24313993. doi: 10.1101/2024.09.19.24313993.

Extracellular Vesicle Protein Expression in Doped Bioactive Glasses: Further Insights Applying Anomaly Detection.外泌体蛋白在掺杂生物活性玻璃中的表达：应用异常检测的进一步见解。

Int J Mol Sci. 2024 Mar 21;25(6):3560. doi: 10.3390/ijms25063560.

Robust SNP-based prediction of rheumatoid arthritis through machine-learning-optimized polygenic risk score.通过机器学习优化的多基因风险评分实现类风湿关节炎的稳健 SNP 预测。

J Transl Med. 2023 Feb 7;21(1):92. doi: 10.1186/s12967-023-03939-5.

Stability of sensorimotor network sculpts the dynamic repertoire of resting state over lifespan.感觉运动网络的稳定性塑造了静息态在整个生命周期中的动态储备。

Cereb Cortex. 2023 Feb 7;33(4):1246-1262. doi: 10.1093/cercor/bhac133.

The AI for Scientific Discovery Network.科学发现网络人工智能

Patterns (N Y). 2021 Jan 8;2(1):100162. doi: 10.1016/j.patter.2020.100162.

本文引用的文献

Machine learning applied to transcriptomic data to identify genes associated with feed efficiency in pigs.应用于转录组数据的机器学习方法，以鉴定与猪饲料效率相关的基因。

Genet Sel Evol. 2019 Mar 13;51(1):10. doi: 10.1186/s12711-019-0453-y.

GiniClust: detecting rare cell types from single-cell gene expression data with Gini index.基尼聚类：利用基尼指数从单细胞基因表达数据中检测稀有细胞类型。

Genome Biol. 2016 Jul 1;17(1):144. doi: 10.1186/s13059-016-1010-4.

Outlier detection at the transcriptome-proteome interface.转录组-蛋白质组界面的异常值检测

Bioinformatics. 2015 Aug 1;31(15):2530-6. doi: 10.1093/bioinformatics/btv182. Epub 2015 Mar 29.

Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells.单细胞 RNA 测序数据中细胞间异质性的计算分析揭示了细胞的隐藏亚群。

Nat Biotechnol. 2015 Feb;33(2):155-60. doi: 10.1038/nbt.3102. Epub 2015 Jan 19.

Bridging the gap between transcriptome and proteome measurements identifies post-translationally regulated genes.转录组和蛋白质组测量之间的差距弥合确定了翻译后调节的基因。

Bioinformatics. 2013 Dec 1;29(23):3060-6. doi: 10.1093/bioinformatics/btt537. Epub 2013 Sep 16.

Network methods for describing sample relationships in genomic datasets: application to Huntington's disease.用于描述基因组数据集中样本关系的网络方法：在亨廷顿舞蹈病中的应用

BMC Syst Biol. 2012 Jun 12;6:63. doi: 10.1186/1752-0509-6-63.

Principal component analysis based methods in bioinformatics studies.基于主成分分析的生物信息学研究方法。

Brief Bioinform. 2011 Nov;12(6):714-22. doi: 10.1093/bib/bbq090. Epub 2011 Jan 17.

An overview of clustering applied to molecular biology.应用于分子生物学的聚类概述。

Methods Mol Biol. 2010;620:369-404. doi: 10.1007/978-1-60761-580-4_12.

Detecting outlier samples in microarray data.检测微阵列数据中的异常样本。

Stat Appl Genet Mol Biol. 2009;8:Article 13. doi: 10.2202/1544-6115.1426. Epub 2009 Feb 11.

Gene expression profiling of colon cancer by DNA microarrays and correlation with histoclinical parameters.通过DNA微阵列技术对结肠癌进行基因表达谱分析及其与组织临床参数的相关性研究。

Oncogene. 2004 Feb 19;23(7):1377-91. doi: 10.1038/sj.onc.1207262.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于基因组数据异常值检测的稳健子空间方法规避了维度诅咒。

Robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献