Suppr超能文献

宏蛋白质组学数据中差异丰度分析的插补和无插补策略评估

Evaluation of imputation and imputation-free strategies for differential abundance analysis in metaproteomics data.

作者信息

Mou Xinyi, Du Haoyu, Qiao Guanghua, Li Jing

机构信息

Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China.

出版信息

Brief Bioinform. 2025 Mar 4;26(2). doi: 10.1093/bib/bbaf141.

Abstract

For metaproteomics data derived from the collective protein composition of dynamic multi-organism systems, the proportion of missing values and dimensions of data exceeds that observed in single-organism experiments. Consequently, evaluations of differential analysis strategies in other mass spectrometry (MS) data (such as proteomics and metabolomics) may not be directly applicable to metaproteomics data. In this study, we systematically evaluated five imputation methods [sample minimum, quantile regression, k-nearest neighbors (KNN), Bayesian principal component analysis (bPCA), random forest (RF)] and six imputation-free methods (moderated t-test, two-part t-test, two-part Wilcoxon test, semiparametric differential abundance analysis, differential abundance analysis with Bayes shrinkage estimation of variance method, and Mixture) for differential analysis in simulated metaproteomic datasets based on both data-dependent acquisition MS experiments and emerging data-independent acquisition experiments. The simulation datasets comprised 588 scenarios by considering the impacts of sample size, fold change between case and control, and missing value ratio at random and nonrandom. Compared to imputation-free methods, KNN, bPCA, and RF imputation performed poorly in datasets with a high missingness ratio and large sample size and resulted in a high false-positive risk. We made empirical recommendations based on the balance of sensitivity in analysis and control of false positives. The moderated t-test was optimal in scenarios of large sample size with a low missingness ratio. The two-part Wilcoxon test was recommended in scenarios of small sample size with a low missingness ratio or large sample size with a high missingness ratio. The comprehensive evaluations in our study can provide guidance for the differential abundance analysis in metaproteomics.

摘要

对于源自动态多生物体系统的集体蛋白质组成的宏蛋白质组学数据,缺失值的比例和数据维度超过了单生物体实验中观察到的情况。因此,对其他质谱(MS)数据(如蛋白质组学和代谢组学)中差异分析策略的评估可能不适用于宏蛋白质组学数据。在本研究中,我们基于数据依赖型采集MS实验和新兴的数据非依赖型采集实验,系统地评估了五种插补方法[样本最小值、分位数回归、k近邻(KNN)、贝叶斯主成分分析(bPCA)、随机森林(RF)]和六种无插补方法(稳健t检验、两部分t检验、两部分威尔科克森检验、半参数差异丰度分析、基于贝叶斯方差收缩估计的差异丰度分析和混合模型)用于模拟宏蛋白质组学数据集中的差异分析。通过考虑样本量、病例与对照之间的倍数变化以及随机和非随机缺失值比例的影响,模拟数据集包含588种情况。与无插补方法相比,KNN、bPCA和RF插补在高缺失率和大样本量的数据集中表现不佳,并导致高假阳性风险。我们基于分析灵敏度和假阳性控制之间的平衡提出了经验性建议。稳健t检验在低缺失率的大样本量情况下是最优的。两部分威尔科克森检验在低缺失率的小样本量或高缺失率的大样本量情况下被推荐。我们研究中的综合评估可为宏蛋白质组学中的差异丰度分析提供指导。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/834d/12009712/0344fdf0d80d/bbaf141f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验