Suppr超能文献

一种用于识别无标记蛋白质组学表达数据归一化和插补方法最佳组合的统计方法。

A Statistical Approach for Identifying the Best Combination of Normalization and Imputation Methods for Label-Free Proteomics Expression Data.

作者信息

Sakthivel Kabilan, Lal Shashi Bhushan, Srivastava Sudhir, Chaturvedi Krishna Kumar, Khan Yasin Jeshima, Mishra Dwijesh Chandra, Madival Sharanbasappa D, Vaidhyanathan Ramasubramanian, Jha Girish Kumar

机构信息

The Graduate School, ICAR-Indian Agricultural Research Institute, New Delhi 110012, India.

Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.

出版信息

J Proteome Res. 2025 Jan 3;24(1):158-170. doi: 10.1021/acs.jproteome.4c00552. Epub 2024 Dec 10.

Abstract

Label-free proteomics expression data sets often exhibit data heterogeneity and missing values, necessitating the development of effective normalization and imputation methods. The selection of appropriate normalization and imputation methods is inherently data-specific, and choosing the optimal approach from the available options is critical for ensuring robust downstream analysis. This study aimed to identify the most suitable combination of these methods for quality control and accurate identification of differentially expressed proteins. In this study, we developed nine combinations by integrating three normalization methods, locally weighted linear regression (LOESS), variance stabilization normalization (VSN), and robust linear regression (RLR) with three imputation methods: k-nearest neighbors (k-NN), local least-squares (LLS), and singular value decomposition (SVD). We utilized statistical measures, including the pooled coefficient of variation (PCV), pooled estimate of variance (PEV), and pooled median absolute deviation (PMAD), to assess intragroup and intergroup variation. The combinations yielding the lowest values corresponding to each statistical measure were chosen as the data set's suitable normalization and imputation methods. The performance of this approach was tested using two spiked-in standard label-free proteomics benchmark data sets. The identified combinations returned a low NRMSE and showed better performance in identifying spiked-in proteins. The developed approach can be accessed through the R package named 'lfproQC' and a user-friendly Shiny web application (https://dabiniasri.shinyapps.io/lfproQC and http://omics.icar.gov.in/lfproQC), making it a valuable resource for researchers looking to apply this method to their data sets.

摘要

无标记蛋白质组学表达数据集常常表现出数据异质性和缺失值,因此需要开发有效的标准化和插补方法。选择合适的标准化和插补方法本质上是特定于数据的,从可用选项中选择最佳方法对于确保稳健的下游分析至关重要。本研究旨在确定这些方法的最合适组合,以进行质量控制和准确鉴定差异表达的蛋白质。在本研究中,我们通过将三种标准化方法(局部加权线性回归(LOESS)、方差稳定标准化(VSN)和稳健线性回归(RLR))与三种插补方法(k近邻(k-NN)、局部最小二乘法(LLS)和奇异值分解(SVD))进行整合,开发了九种组合。我们利用统计量,包括合并变异系数(PCV)、合并方差估计值(PEV)和合并中位数绝对偏差(PMAD),来评估组内和组间变异。对应于每个统计量产生最低值的组合被选为数据集合适的标准化和插补方法。使用两个加标的标准无标记蛋白质组学基准数据集测试了该方法的性能。所确定的组合返回了较低的归一化均方根误差(NRMSE),并且在鉴定加标的蛋白质方面表现出更好的性能。所开发的方法可以通过名为“lfproQC”的R包和一个用户友好的Shiny网络应用程序(https://dabiniasri.shinyapps.io/lfproQC和http://omics.icar.gov.in/lfproQC)访问,这使其成为希望将该方法应用于其数据集的研究人员的宝贵资源。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验