Brombacher Eva, Schilling Oliver, Kreutz Clemens
Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center-University of Freiburg, Freiburg, Germany.
Centre for Integrative Biological Signaling Studies (CIBSS), University of Freiburg, Freiburg, Germany.
Sci Rep. 2025 Jan 25;15(1):3189. doi: 10.1038/s41598-025-87256-5.
The characteristics of data produced by omics technologies are pivotal, as they critically influence the feasibility and effectiveness of computational methods applied in downstream analyses, such as data harmonization and differential abundance analyses. Furthermore, variability in these data characteristics across datasets plays a crucial role, leading to diverging outcomes in benchmarking studies, which are essential for guiding the selection of appropriate analysis methods in all omics fields. Additionally, downstream analysis tools are often developed and applied within specific omics communities due to the presumed differences in data characteristics attributed to each omics technology. In this study, we investigate over ten thousand datasets to understand how proteomics, metabolomics, lipidomics, transcriptomics, and microbiome data vary in specific data characteristics. We were able to show patterns of data characteristics specific to the investigated omics types and provide a tool that enables researchers to assess how representative a given omics dataset is for its respective discipline. Moreover, we illustrate how data characteristics can impact analyses at the example of normalization in the presence of sample-dependent proportions of missing values. Given the variability of omics data characteristics, we encourage the systematic inspection of these characteristics in benchmark studies and for downstream analyses to prevent suboptimal method selection and unintended bias.
组学技术产生的数据特征至关重要,因为它们严重影响下游分析中应用的计算方法的可行性和有效性,如下游的数据整合和差异丰度分析。此外,跨数据集的这些数据特征的变异性起着关键作用,导致基准研究的结果出现分歧,而基准研究对于指导所有组学领域中合适分析方法的选择至关重要。此外,由于假定每种组学技术的数据特征存在差异,下游分析工具通常是在特定的组学领域内开发和应用的。在本研究中,我们调查了一万多个数据集,以了解蛋白质组学、代谢组学、脂质组学、转录组学和微生物组数据在特定数据特征方面是如何变化的。我们能够展示所研究的组学类型特有的数据特征模式,并提供一种工具,使研究人员能够评估给定的组学数据集对其各自学科的代表性如何。此外,我们以存在样本依赖的缺失值比例时的标准化为例,说明数据特征如何影响分析。鉴于组学数据特征的变异性,我们鼓励在基准研究和下游分析中对这些特征进行系统检查,以防止选择次优方法和意外偏差。