机器学习算法和降维方法在药物敏感性预测中的全面基准测试。

A comprehensive benchmarking of machine learning algorithms and dimensionality reduction methods for drug sensitivity prediction.

机构信息

Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123, Saarland, Germany.

出版信息

Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae242.

DOI:10.1093/bib/bbae242

PMID:38797968

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11128483/

Abstract

A major challenge of precision oncology is the identification and prioritization of suitable treatment options based on molecular biomarkers of the considered tumor. In pursuit of this goal, large cancer cell line panels have successfully been studied to elucidate the relationship between cellular features and treatment response. Due to the high dimensionality of these datasets, machine learning (ML) is commonly used for their analysis. However, choosing a suitable algorithm and set of input features can be challenging. We performed a comprehensive benchmarking of ML methods and dimension reduction (DR) techniques for predicting drug response metrics. Using the Genomics of Drug Sensitivity in Cancer cell line panel, we trained random forests, neural networks, boosting trees and elastic nets for 179 anti-cancer compounds with feature sets derived from nine DR approaches. We compare the results regarding statistical performance, runtime and interpretability. Additionally, we provide strategies for assessing model performance compared with a simple baseline model and measuring the trade-off between models of different complexity. Lastly, we show that complex ML models benefit from using an optimized DR strategy, and that standard models-even when using considerably fewer features-can still be superior in performance.

摘要

精准肿瘤学的主要挑战之一是根据所考虑肿瘤的分子生物标志物来识别和优先考虑合适的治疗方案。为了实现这一目标，人们已经成功地研究了大型癌细胞系面板，以阐明细胞特征与治疗反应之间的关系。由于这些数据集的高维性，通常使用机器学习 (ML) 来对其进行分析。然而，选择合适的算法和输入特征集可能具有挑战性。我们对用于预测药物反应指标的 ML 方法和降维 (DR) 技术进行了全面基准测试。我们使用癌症细胞系面板中的药物敏感性基因组学，针对 179 种抗癌化合物进行了随机森林、神经网络、提升树和弹性网络的训练，特征集来自九种 DR 方法。我们比较了关于统计性能、运行时和可解释性的结果。此外，我们提供了评估模型性能的策略，与简单的基线模型进行比较，并衡量不同复杂性模型之间的权衡。最后，我们表明，复杂的 ML 模型受益于使用优化的 DR 策略，而标准模型——即使使用的特征数量少得多——在性能上仍然可能更优。