Bioinformatics Division, WEHI, Melbourne, Australia; Department of Medical Biology, University of Melbourne, Melbourne, Australia; Colonial Foundation Healthy Ageing Centre, WEHI, Melbourne, Australia.
Department of Medical Biology, University of Melbourne, Melbourne, Australia; Colonial Foundation Healthy Ageing Centre, WEHI, Melbourne, Australia; Advanced Technology and Biology Division, WEHI, Melbourne, Australia.
Mol Cell Proteomics. 2023 Aug;22(8):100558. doi: 10.1016/j.mcpro.2023.100558. Epub 2023 Apr 25.
Mass spectrometry (MS) enables high-throughput identification and quantification of proteins in complex biological samples and can provide insights into the global function of biological systems. Label-free quantification is cost-effective and suitable for the analysis of human samples. Despite rapid developments in label-free data acquisition workflows, the number of proteins quantified across samples can be limited by technical and biological variability. This variation can result in missing values which can in turn challenge downstream data analysis tasks. General purpose or gene expression-specific imputation algorithms are widely used to improve data completeness. Here, we propose an imputation algorithm designated for label-free MS data that is aware of the type of missingness affecting data. On published datasets acquired by data-dependent and data-independent acquisition workflows with variable degrees of biological complexity, we demonstrate that the proposed missing value estimation procedure by barycenter computation competes closely with the state-of-the-art imputation algorithms in differential abundance tasks while outperforming them in the accuracy of variance estimates of the peptide abundance measurements, and better controls the false discovery rate in label-free MS experiments. The barycenter estimation procedure is implemented in the msImpute software package and is available from the Bioconductor repository.
质谱 (MS) 能够高通量鉴定和定量复杂生物样本中的蛋白质,并能够深入了解生物系统的全局功能。无标记定量是一种具有成本效益的方法,适用于人类样本的分析。尽管无标记数据采集工作流程发展迅速,但由于技术和生物学变异性,跨样本定量的蛋白质数量可能会受到限制。这种变化会导致缺失值,进而挑战下游数据分析任务。通用或基因表达特异性插补算法被广泛用于提高数据完整性。在这里,我们提出了一种专门用于无标记 MS 数据的插补算法,该算法能够识别影响数据的缺失类型。在使用不同程度生物学复杂性的数据依赖和数据独立采集工作流程获取的已发表数据集上,我们证明了基于重心计算的提出的缺失值估计程序在差异丰度任务中与最先进的插补算法竞争激烈,同时在肽丰度测量的方差估计的准确性上优于它们,并更好地控制无标记 MS 实验中的假发现率。重心估计程序在 msImpute 软件包中实现,并可从 Bioconductor 存储库中获得。