Branson Nikhil, Cutillas Pedro R, Bessant Conrad
School of Biological and Behavioural Sciences, Queen Mary University of London, London E1 4NS, United Kingdom.
Digital Environment Research Institute, Queen Mary University of London, London E1 1HH, United Kingdom.
Bioinform Adv. 2023 Dec 23;4(1):vbad190. doi: 10.1093/bioadv/vbad190. eCollection 2024.
Anti-cancer drug response prediction is a central problem within stratified medicine. Transcriptomic profiles of cancer cell lines are typically used for drug response prediction, but we hypothesize that proteomics or phosphoproteomics might be more suitable as they give a more direct insight into cellular processes. However, there has not yet been a systematic comparison between all three of these datatypes using consistent evaluation criteria.
Due to the limited number of cell lines with phosphoproteomics profiles we use learning curves, a plot of predictive performance as a function of dataset size, to compare the current performance and predict the future performance of the three omics datasets with more data. We use neural networks and XGBoost and compare them against a simple rule-based benchmark. We show that phosphoproteomics slightly outperforms RNA-seq and proteomics using the 38 cell lines with profiles of all three omics data types. Furthermore, using the 877 cell lines with proteomics and RNA-seq profiles, we show that RNA-seq slightly outperforms proteomics. With the learning curves we predict that the mean squared error using the phosphoproteomics dataset would decrease by if a dataset of the same size as the proteomics/transcriptomics was collected. For the cell lines with proteomics and RNA-seq profiles the learning curves reveal that for smaller dataset sizes neural networks outperform XGBoost and for larger datasets. Furthermore, the trajectory of the XGBoost curve suggests that it will improve faster than the neural networks as more data are collected.
See https://github.com/Nik-BB/Learning-curves-for-DRP for the code used.
抗癌药物反应预测是分层医学中的核心问题。癌细胞系的转录组谱通常用于药物反应预测,但我们推测蛋白质组学或磷酸化蛋白质组学可能更合适,因为它们能更直接地洞察细胞过程。然而,尚未使用一致的评估标准对这三种数据类型进行系统比较。
由于具有磷酸化蛋白质组学谱的细胞系数量有限,我们使用学习曲线(一种将预测性能绘制成数据集大小函数的图)来比较当前性能,并预测随着数据增多这三种组学数据集的未来性能。我们使用神经网络和XGBoost,并将它们与基于简单规则的基准进行比较。我们表明,使用具有所有三种组学数据类型谱的38个细胞系时,磷酸化蛋白质组学略优于RNA测序和蛋白质组学。此外,使用具有蛋白质组学和RNA测序谱的877个细胞系,我们表明RNA测序略优于蛋白质组学。通过学习曲线,我们预测如果收集与蛋白质组学/转录组学大小相同的数据集,使用磷酸化蛋白质组学数据集的均方误差将降低 。对于具有蛋白质组学和RNA测序谱的细胞系,学习曲线表明对于较小的数据集大小,神经网络优于XGBoost,而对于较大的数据集则相反。此外,XGBoost曲线的轨迹表明,随着更多数据的收集,它将比神经网络改善得更快。
有关所用代码,请参阅https://github.com/Nik-BB/Learning-curves-for-DRP 。