Fu Yingxue, Yuan Zuo-Fei, Wu Long, Peng Junmin, Wang Xusheng, High Anthony A
Center for Proteomics and Metabolomics, St. Jude Children's Research Hospital, Memphis, Tennessee, USA.
Department of Structural Biology, St. Jude Children's Research Hospital, Memphis, Tennessee, USA.
Proteomics. 2025 Jan;25(1-2):e202400271. doi: 10.1002/pmic.202400271. Epub 2024 Dec 10.
Advances in high-throughput omics technologies have enabled system-wide characterization of biological samples across multiple molecular levels, such as the genome, transcriptome, and proteome. However, as sample sizes rapidly increase in large-scale multi-omics studies, sample mix-ups have become a prevalent issue, compromising data integrity and leading to erroneous conclusions. The interconnected nature of multi-omics data presents an opportunity to identify and correct these errors. This review examines the potential sources of sample mix-ups and evaluates the methodologies and tools developed for detecting and correcting these errors, with an emphasis on approaches applicable to proteomics data. We categorize existing tools into three main groups: expression/protein quantitative trait loci-based, genotype concordance-based, and gene/protein expression correlation-based approaches. Notably, only a handful of tools currently utilize the proteogenomics approach for correcting sample mix-ups at the proteomics level. Integrating the strengths of current tools across diverse data types could enable the development of more versatile and comprehensive solutions. In conclusion, verifying sample identity is a critical first step to reduce bias and increase precision in subsequent analyses for large-scale multi-omics studies. By leveraging these tools for identifying and correcting sample mix-ups, researchers can significantly improve the reliability and reproducibility of biomedical research.
高通量组学技术的进步使得能够在多个分子水平上对生物样本进行全系统表征,如基因组、转录组和蛋白质组。然而,在大规模多组学研究中,随着样本量迅速增加,样本混淆已成为一个普遍问题,损害了数据完整性并导致错误结论。多组学数据的相互关联特性为识别和纠正这些错误提供了契机。本综述探讨了样本混淆的潜在来源,并评估了为检测和纠正这些错误而开发的方法和工具,重点关注适用于蛋白质组学数据的方法。我们将现有工具分为三大类:基于表达/蛋白质数量性状位点的方法、基于基因型一致性的方法以及基于基因/蛋白质表达相关性的方法。值得注意的是,目前只有少数工具利用蛋白质基因组学方法在蛋白质组学水平上纠正样本混淆。整合当前工具在不同数据类型中的优势,有望开发出更通用、更全面的解决方案。总之,验证样本身份是减少大规模多组学研究后续分析中的偏差并提高精度的关键第一步。通过利用这些工具来识别和纠正样本混淆,研究人员能够显著提高生物医学研究的可靠性和可重复性。