Suppr超能文献

混合高维分子数据中的异常检测。

Anomaly detection in mixed high-dimensional molecular data.

机构信息

Department of Statistical Bioinformatics, University of Regensburg, 93040 Regensburg, Germany.

Department of Hematology and Medical Oncology, University Medicine Gottingen, 37075 Gottingen, Germany.

出版信息

Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad501.

Abstract

MOTIVATION

Mixed molecular data combines continuous and categorical features of the same samples, such as OMICS profiles with genotypes, diagnoses, or patient sex. Like all high-dimensional molecular data, it is prone to incorrect values that can stem from various sources for example the technical limitations of the measurement devices, errors in the sample preparation, or contamination. Most anomaly detection algorithms identify complete samples as outliers or anomalies. However, in most cases, not all measurements of those samples are erroneous but only a few one-dimensional features within the samples are incorrect. These one-dimensional data errors are continuous measurements that are either located outside or inside the normal ranges of their features but in both cases show atypical values given all other continuous and categorical features in the sample. Additionally, categorical anomalies can occur for example when the genotype or diagnosis was submitted wrongly.

RESULTS

We introduce ADMIRE (Anomaly Detection using MIxed gRaphical modEls), a novel approach for the detection and correction of anomalies in mixed high-dimensional data. Hereby, we focus on the detection of single (one-dimensional) data errors in the categorical and continuous features of a sample. For that the joint distribution of continuous and categorical features is learned by mixed graphical models, anomalies are detected by the difference between measured and model-based estimations and are corrected using imputation. We evaluated ADMIRE in simulation and by screening for anomalies in one of our own metabolic datasets. In simulation experiments, ADMIRE outperformed the state-of-the-art methods of Local Outlier Factor, stray, and Isolation Forest.

AVAILABILITY AND IMPLEMENTATION

All data and code is available at https://github.com/spang-lab/adadmire. ADMIRE is implemented in a Python package called adadmire which can be found at https://pypi.org/project/adadmire.

摘要

动机

混合分子数据结合了同一样本的连续和分类特征,例如基因组学数据与基因型、诊断或患者性别。与所有高维分子数据一样,它容易受到错误值的影响,这些错误值可能源于多种来源,例如测量设备的技术限制、样品制备过程中的错误或污染。大多数异常检测算法将完整的样本识别为异常值或异常。然而,在大多数情况下,并非所有这些样本的测量值都是错误的,而是只有样本中的几个一维特征是不正确的。这些一维数据错误是连续测量值,要么位于其特征的正常范围之外,要么位于正常范围之内,但在两种情况下,给定样本中所有其他连续和分类特征,这些值都显示出非典型值。此外,可能会出现分类异常,例如基因型或诊断被错误提交。

结果

我们引入了 ADMIRE(使用混合图形模型进行异常检测),这是一种用于检测和纠正混合高维数据中异常的新方法。在这里,我们专注于检测样本中分类和连续特征的单个(一维)数据错误。为此,通过混合图形模型学习连续和分类特征的联合分布,通过测量值与基于模型的估计值之间的差异检测异常,并使用插补进行校正。我们在模拟和我们自己的代谢数据集之一中筛选异常时评估了 ADMIRE。在模拟实验中,ADMIRE 优于 Local Outlier Factor、stray 和 Isolation Forest 等最先进的方法。

可用性和实现

所有数据和代码都可在 https://github.com/spang-lab/adadmire 上获得。ADMIRE 是用 Python 包 adadmire 实现的,可在 https://pypi.org/project/adadmire 找到。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4cf/10457663/379b33cfbdf6/btad501f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验