Division of Data Science and Learning, Argonne National Laboratory, Lemont, IL, USA.
University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, USA.
BMC Bioinformatics. 2021 May 17;22(1):252. doi: 10.1186/s12859-021-04163-y.
Motivated by the size and availability of cell line drug sensitivity data, researchers have been developing machine learning (ML) models for predicting drug response to advance cancer treatment. As drug sensitivity studies continue generating drug response data, a common question is whether the generalization performance of existing prediction models can be further improved with more training data.
We utilize empirical learning curves for evaluating and comparing the data scaling properties of two neural networks (NNs) and two gradient boosting decision tree (GBDT) models trained on four cell line drug screening datasets. The learning curves are accurately fitted to a power law model, providing a framework for assessing the data scaling behavior of these models.
The curves demonstrate that no single model dominates in terms of prediction performance across all datasets and training sizes, thus suggesting that the actual shape of these curves depends on the unique pair of an ML model and a dataset. The multi-input NN (mNN), in which gene expressions of cancer cells and molecular drug descriptors are input into separate subnetworks, outperforms a single-input NN (sNN), where the cell and drug features are concatenated for the input layer. In contrast, a GBDT with hyperparameter tuning exhibits superior performance as compared with both NNs at the lower range of training set sizes for two of the tested datasets, whereas the mNN consistently performs better at the higher range of training sizes. Moreover, the trajectory of the curves suggests that increasing the sample size is expected to further improve prediction scores of both NNs. These observations demonstrate the benefit of using learning curves to evaluate prediction models, providing a broader perspective on the overall data scaling characteristics.
A fitted power law learning curve provides a forward-looking metric for analyzing prediction performance and can serve as a co-design tool to guide experimental biologists and computational scientists in the design of future experiments in prospective research studies.
受细胞系药物敏感性数据的规模和可用性的推动,研究人员一直在开发机器学习 (ML) 模型,以预测药物反应,从而推进癌症治疗。随着药物敏感性研究继续生成药物反应数据,一个常见的问题是,现有预测模型的泛化性能是否可以通过更多的训练数据进一步提高。
我们利用经验学习曲线来评估和比较在四个细胞系药物筛选数据集上训练的两个神经网络 (NN) 和两个梯度提升决策树 (GBDT) 模型的数据缩放特性。这些学习曲线被准确地拟合到幂律模型中,为评估这些模型的数据缩放行为提供了一个框架。
曲线表明,在所有数据集和训练规模上,没有一个单一的模型在预测性能方面占主导地位,因此,这些曲线的实际形状取决于特定的 ML 模型和数据集之间的独特组合。多输入神经网络 (mNN),其中癌细胞的基因表达和分子药物描述符被输入到两个独立的子网络中,优于单输入神经网络 (sNN),其中细胞和药物特征被串联到输入层。相比之下,在两个测试数据集的训练集规模较小的范围内,具有超参数调整的 GBDT 表现优于两个 NN;而在训练集规模较大的范围内,mNN 始终表现更好。此外,曲线的轨迹表明,增加样本量有望进一步提高两个 NN 的预测分数。这些观察结果表明,使用学习曲线评估预测模型具有优势,为整体数据缩放特性提供了更广泛的视角。
拟合的幂律学习曲线为分析预测性能提供了一个前瞻性的指标,并可以作为一个共同设计工具,指导实验生物学家和计算科学家在未来的前瞻性研究中设计实验。