混合高维分子数据中的异常检测。

Anomaly detection in mixed high-dimensional molecular data.

机构信息

Department of Statistical Bioinformatics, University of Regensburg, 93040 Regensburg, Germany.

Department of Hematology and Medical Oncology, University Medicine Gottingen, 37075 Gottingen, Germany.

出版信息

Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad501.

DOI:10.1093/bioinformatics/btad501

PMID:37584673

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10457663/

Abstract

MOTIVATION

Mixed molecular data combines continuous and categorical features of the same samples, such as OMICS profiles with genotypes, diagnoses, or patient sex. Like all high-dimensional molecular data, it is prone to incorrect values that can stem from various sources for example the technical limitations of the measurement devices, errors in the sample preparation, or contamination. Most anomaly detection algorithms identify complete samples as outliers or anomalies. However, in most cases, not all measurements of those samples are erroneous but only a few one-dimensional features within the samples are incorrect. These one-dimensional data errors are continuous measurements that are either located outside or inside the normal ranges of their features but in both cases show atypical values given all other continuous and categorical features in the sample. Additionally, categorical anomalies can occur for example when the genotype or diagnosis was submitted wrongly.

RESULTS

We introduce ADMIRE (Anomaly Detection using MIxed gRaphical modEls), a novel approach for the detection and correction of anomalies in mixed high-dimensional data. Hereby, we focus on the detection of single (one-dimensional) data errors in the categorical and continuous features of a sample. For that the joint distribution of continuous and categorical features is learned by mixed graphical models, anomalies are detected by the difference between measured and model-based estimations and are corrected using imputation. We evaluated ADMIRE in simulation and by screening for anomalies in one of our own metabolic datasets. In simulation experiments, ADMIRE outperformed the state-of-the-art methods of Local Outlier Factor, stray, and Isolation Forest.

AVAILABILITY AND IMPLEMENTATION

All data and code is available at https://github.com/spang-lab/adadmire. ADMIRE is implemented in a Python package called adadmire which can be found at https://pypi.org/project/adadmire.

摘要

动机

混合分子数据结合了同一样本的连续和分类特征，例如基因组学数据与基因型、诊断或患者性别。与所有高维分子数据一样，它容易受到错误值的影响，这些错误值可能源于多种来源，例如测量设备的技术限制、样品制备过程中的错误或污染。大多数异常检测算法将完整的样本识别为异常值或异常。然而，在大多数情况下，并非所有这些样本的测量值都是错误的，而是只有样本中的几个一维特征是不正确的。这些一维数据错误是连续测量值，要么位于其特征的正常范围之外，要么位于正常范围之内，但在两种情况下，给定样本中所有其他连续和分类特征，这些值都显示出非典型值。此外，可能会出现分类异常，例如基因型或诊断被错误提交。

结果

我们引入了 ADMIRE（使用混合图形模型进行异常检测），这是一种用于检测和纠正混合高维数据中异常的新方法。在这里，我们专注于检测样本中分类和连续特征的单个（一维）数据错误。为此，通过混合图形模型学习连续和分类特征的联合分布，通过测量值与基于模型的估计值之间的差异检测异常，并使用插补进行校正。我们在模拟和我们自己的代谢数据集之一中筛选异常时评估了 ADMIRE。在模拟实验中，ADMIRE 优于 Local Outlier Factor、stray 和 Isolation Forest 等最先进的方法。

可用性和实现

所有数据和代码都可在 https://github.com/spang-lab/adadmire 上获得。ADMIRE 是用 Python 包 adadmire 实现的，可在 https://pypi.org/project/adadmire 找到。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4cf/10457663/379b33cfbdf6/btad501f1.jpg

相似文献

Anomaly detection in mixed high-dimensional molecular data.混合高维分子数据中的异常检测。

Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad501.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Software application profile: tpc and micd-R packages for causal discovery with incomplete cohort data.软件应用程序简介：用于不完全队列数据因果发现的 tpc 和 micd-R 包。

Int J Epidemiol. 2024 Aug 14;53(5). doi: 10.1093/ije/dyae113.

Identifying interactions in omics data for clinical biomarker discovery using symbolic regression.利用符号回归识别组学数据中的相互作用，以发现临床生物标志物。

Bioinformatics. 2022 Aug 2;38(15):3749-3758. doi: 10.1093/bioinformatics/btac405.

dRFEtools: dynamic recursive feature elimination for omics.dRFEtools：组学的动态递归特征消除。

Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad513.

GSEApy: a comprehensive package for performing gene set enrichment analysis in Python.GSEApy：一个用于在 Python 中进行基因集富集分析的综合软件包。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac757.

flowAI: automatic and interactive anomaly discerning tools for flow cytometry data.flowAI：流式细胞术数据的自动和交互式异常甄别工具。

Bioinformatics. 2016 Aug 15;32(16):2473-80. doi: 10.1093/bioinformatics/btw191. Epub 2016 Apr 10.

Missing value imputation in high-dimensional phenomic data: imputable or not, and how?高维表型组数据中的缺失值插补：是否可插补以及如何插补？

BMC Bioinformatics. 2014 Nov 5;15(1):346. doi: 10.1186/s12859-014-0346-6.

hapCon: estimating contamination of ancient genomes by copying from reference haplotypes.hapCon：通过从参考单倍型复制来估计古代基因组的污染。

Bioinformatics. 2022 Aug 2;38(15):3768-3777. doi: 10.1093/bioinformatics/btac390.

An integrated approach for identifying wrongly labelled samples when performing classification in microarray data.一种在微阵列数据分析中进行分类时识别错误标记样本的综合方法。

PLoS One. 2012;7(10):e46700. doi: 10.1371/journal.pone.0046700. Epub 2012 Oct 17.

引用本文的文献

Extracellular Vesicle Protein Expression in Doped Bioactive Glasses: Further Insights Applying Anomaly Detection.外泌体蛋白在掺杂生物活性玻璃中的表达：应用异常检测的进一步见解。

Int J Mol Sci. 2024 Mar 21;25(6):3560. doi: 10.3390/ijms25063560.

本文引用的文献

Gaussian and Mixed Graphical Models as (multi-)omics data analysis tools.高斯和混合图模型作为（多组学）数据分析工具。

Biochim Biophys Acta Gene Regul Mech. 2020 Jun;1863(6):194418. doi: 10.1016/j.bbagrm.2019.194418. Epub 2019 Oct 19.

A multi-source data integration approach reveals novel associations between metabolites and renal outcomes in the German Chronic Kidney Disease study.一种多源数据整合方法揭示了德国慢性肾脏病研究中代谢物与肾脏结局之间的新关联。

Sci Rep. 2019 Sep 27;9(1):13954. doi: 10.1038/s41598-019-50346-2.

Cancer Genetic Network Inference Using Gaussian Graphical Models.使用高斯图形模型进行癌症遗传网络推断

Bioinform Biol Insights. 2019 Apr 8;13:1177932219839402. doi: 10.1177/1177932219839402. eCollection 2019.

Cooperative STAT/NF-κB signaling regulates lymphoma metabolic reprogramming and aberrant GOT2 expression.协同的 STAT/NF-κB 信号调节淋巴瘤代谢重编程和 GOT2 的异常表达。

Nat Commun. 2018 Apr 17;9(1):1514. doi: 10.1038/s41467-018-03803-x.

FastGGM: An Efficient Algorithm for the Inference of Gaussian Graphical Model in Biological Networks.FastGGM：一种用于生物网络中高斯图形模型推断的高效算法。

PLoS Comput Biol. 2016 Feb 12;12(2):e1004755. doi: 10.1371/journal.pcbi.1004755. eCollection 2016 Feb.

Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome.自组织特征映射识别唐氏综合征小鼠模型中对学习至关重要的蛋白质。

PLoS One. 2015 Jun 25;10(6):e0129126. doi: 10.1371/journal.pone.0129126. eCollection 2015.

Learning the Structure of Mixed Graphical Models.学习混合图形模型的结构

J Comput Graph Stat. 2015 Jan 1;24(1):230-253. doi: 10.1080/10618600.2014.900500.

Joint conditional Gaussian graphical models with multiple sources of genomic data.具有多种基因组数据源的联合条件高斯图形模型。

Front Genet. 2013 Dec 17;4:294. doi: 10.3389/fgene.2013.00294. eCollection 2013.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

混合高维分子数据中的异常检测。

Anomaly detection in mixed high-dimensional molecular data.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献