Laboratory of Immune System Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland, USA.
J Comput Biol. 2023 Jun;30(6):726-735. doi: 10.1089/cmb.2022.0243. Epub 2023 Apr 12.
Detection of omics sample outliers is important for preventing erroneous biological conclusions, developing robust experimental protocols, and discovering rare biological states. Two recent publications describe robust algorithms for detecting transcriptomic sample outliers, but neither algorithm had been incorporated into a software tool for scientists. Here we describe Ensemble Methods for Outlier Detection (EnsMOD) which incorporates both algorithms. EnsMOD calculates how closely the quantitation variation follows a normal distribution, plots the density curves of each sample to visualize anomalies, performs hierarchical cluster analyses to calculate how closely the samples cluster with each other, and performs robust principal component analyses to statistically test if any sample is an outlier. The probabilistic threshold parameters can be easily adjusted to tighten or loosen the outlier detection stringency. EnsMOD can be used to analyze any omics dataset with normally distributed variance. Here it was used to analyze a simulated proteomics dataset, a multiomic (proteome and transcriptome) dataset, a single-cell proteomics dataset, and a phosphoproteomics dataset. EnsMOD successfully identified all of the simulated outliers, and subsequent removal of a detected outlier improved data quality for downstream statistical analyses.
检测组学样本离群值对于防止错误的生物学结论、开发稳健的实验方案以及发现罕见的生物学状态非常重要。最近有两篇文献描述了用于检测转录组样本离群值的稳健算法,但这两种算法都没有被纳入科学家使用的软件工具中。在这里,我们描述了集成方法用于离群值检测(EnsMOD),它整合了这两种算法。EnsMOD 计算定量变化如何接近正态分布,绘制每个样本的密度曲线以可视化异常值,进行层次聚类分析以计算样本之间的聚类程度,以及进行稳健主成分分析以统计检验是否有任何样本是离群值。概率阈值参数可以轻松调整,以收紧或放宽离群值检测的严格性。EnsMOD 可用于分析具有正态分布方差的任何组学数据集。在这里,它被用于分析模拟的蛋白质组学数据集、多组学(蛋白质组和转录组)数据集、单细胞蛋白质组学数据集和磷酸化蛋白质组学数据集。EnsMOD 成功地识别了所有模拟的离群值,并且随后去除一个检测到的离群值提高了下游统计分析的数据质量。