Forsgren Edvin, Björkblom Benny, Trygg Johan, Jonsson Pär
Computational Life Science Cluster (CLiC), Department of Chemistry, Umeå University, SE-901 87 Umeå, Sweden.
Department of Chemistry, Umeå University, SE-901 87 Umeå, Sweden.
J Chem Inf Model. 2025 Feb 24;65(4):1762-1770. doi: 10.1021/acs.jcim.4c01799. Epub 2025 Feb 3.
Multiclass data sets and large-scale studies are increasingly common in omics sciences, drug discovery, and clinical research due to advancements in analytical platforms. Efficiently handling these data sets and discerning subtle differences across multiple classes remains a significant challenge. In metabolomics, two-class orthogonal projection to latent structures discriminant analysis (OPLS-DA) models are widely used due to their strong discrimination capabilities and ability to provide interpretable information on class differences. However, these models face challenges in multiclass settings. A common solution is to transform the multiclass comparison into multiple two-class comparisons, which, while more effective than a global multiclass OPLS-DA model, unfortunately results in a manual, time-consuming model-building process with complicated interpretation. Here, we introduce an extension of OPLS-DA for data-driven multiclass classification: orthogonal partial least squares-hierarchical discriminant analysis (OPLS-HDA). OPLS-HDA integrates hierarchical cluster analysis (HCA) with the OPLS-DA framework to create a decision tree, addressing multiclass classification challenges and providing intuitive visualization of interclass relationships. To avoid overfitting and ensure reliable predictions, we use cross-validation during model building. Benchmark results show that OPLS-HDA performs competitively across diverse data sets compared to eight established methods. This method represents a significant advancement, offering a powerful tool to dissect complex multiclass data sets. With its versatility, interpretability, and ease of use, OPLS-HDA is an efficient approach to multiclass data analysis applicable across various fields.
由于分析平台的进步,多类数据集和大规模研究在组学科学、药物发现和临床研究中越来越普遍。有效处理这些数据集并辨别多个类别之间的细微差异仍然是一项重大挑战。在代谢组学中,两类正交投影到潜在结构判别分析(OPLS-DA)模型因其强大的判别能力和提供关于类别差异的可解释信息的能力而被广泛使用。然而,这些模型在多类设置中面临挑战。一种常见的解决方案是将多类比较转换为多个两类比较,虽然这比全局多类OPLS-DA模型更有效,但不幸的是,这会导致一个手动、耗时的模型构建过程,且解释复杂。在这里,我们介绍一种用于数据驱动的多类分类的OPLS-DA扩展:正交偏最小二乘-层次判别分析(OPLS-HDA)。OPLS-HDA将层次聚类分析(HCA)与OPLS-DA框架集成以创建决策树,解决多类分类挑战并提供类间关系的直观可视化。为避免过拟合并确保可靠的预测,我们在模型构建过程中使用交叉验证。基准测试结果表明,与八种已确立的方法相比,OPLS-HDA在各种数据集上的表现具有竞争力。这种方法代表了一项重大进展,提供了一个剖析复杂多类数据集的强大工具。凭借其通用性、可解释性和易用性,OPLS-HDA是一种适用于各个领域的多类数据分析的有效方法。