“组学”数据分析的试验与磨难：以肺病学为例评估基于SIMCA的多变量模型的质量

Trials and tribulations of 'omics data analysis: assessing quality of SIMCA-based multivariate models using examples from pulmonary medicine.

作者信息

Wheelock Åsa M, Wheelock Craig E

机构信息

Respiratory Medicine Unit, Department of Medicine, and Center for Molecular Medicine, Karolinska Institutet, Stockholm, Sweden.

出版信息

Mol Biosyst. 2013 Nov;9(11):2589-96. doi: 10.1039/c3mb70194h.

DOI:10.1039/c3mb70194h

PMID:23999822

Abstract

Respiratory diseases are multifactorial heterogeneous diseases that have proved recalcitrant to understanding using focused molecular techniques. This trend has led to the rise of 'omics approaches (e.g., transcriptomics, proteomics) and subsequent acquisition of large-scale datasets consisting of multiple variables. In 'omics technology-based investigations, discrepancies between the number of variables analyzed (e.g., mRNA, proteins, metabolites) and the number of study subjects constitutes a major statistical challenge. The application of traditional univariate statistical methods (e.g., t-test) to these "short-and-wide" datasets may result in high numbers of false positives, while the predominant approach of p-value correction to account for these high false positive rates (e.g., FDR, Bonferroni) are associated with significant losses in statistical power. In other words, the benefit in decreased false positives must be counterbalanced with a concomitant loss in true positives. As an alternative, multivariate statistical analysis (MVA) is increasingly being employed to cope with 'omics-based data structures. When properly applied, MVA approaches can be powerful tools for integration and interpretation of complex 'omics-based datasets towards the goal of identifying biomarkers and/or subphenotypes. However, MVA methods are also prone to over-interpretation and misuse. A common software used in biomedical research to perform MVA-based analyses is the SIMCA package, which includes multiple MVA methods. In this opinion piece, we propose guidelines for minimum reporting standards for a SIMCA-based workflow, in terms of data preprocessing (e.g., normalization, scaling) and model statistics (number of components, R2, Q2, and CV-ANOVA p-value). Examples of these applications in recent COPD and asthma studies are provided. It is expected that readers will gain an increased understanding of the power and utility of MVA methods for applications in biomedical research.

摘要

呼吸系统疾病是多因素的异质性疾病，事实证明，使用专注的分子技术难以对其进行理解。这种趋势导致了“组学”方法（如转录组学、蛋白质组学）的兴起以及随后获取由多个变量组成的大规模数据集。在基于组学技术的研究中，分析的变量数量（如mRNA、蛋白质、代谢物）与研究对象数量之间的差异构成了一个主要的统计挑战。将传统的单变量统计方法（如t检验）应用于这些“短而宽”的数据集可能会导致大量假阳性结果，而用于校正这些高假阳性率的主要方法（如FDR、Bonferroni）会伴随着统计功效的显著损失。换句话说，在减少假阳性方面的益处必须与真阳性的相应损失相平衡。作为一种替代方法，多变量统计分析（MVA）越来越多地被用于处理基于组学的数据结构。如果应用得当，MVA方法可以成为整合和解释基于复杂组学的数据集以识别生物标志物和/或亚表型的有力工具。然而，MVA方法也容易被过度解读和滥用。生物医学研究中用于进行基于MVA分析的一种常见软件是SIMCA软件包，它包括多种MVA方法。在这篇观点文章中，我们就基于SIMCA的工作流程提出了最低报告标准指南，涉及数据预处理（如归一化、缩放）和模型统计（成分数量、R2、Q2和CV - ANOVA p值）。还提供了这些应用在近期慢性阻塞性肺疾病（COPD）和哮喘研究中的示例。期望读者能更好地理解MVA方法在生物医学研究应用中的功效和实用性。