Laboratory of Translational Oncology, Intercollegiate Faculty of Biotechnology of the University of Gdańsk and the Medical University of Gdańsk, Poland.
Centre of Biostatistics and Bioinformatics, Medical University of Gdańsk, Poland.
Mol Oncol. 2024 Nov;18(11):2743-2754. doi: 10.1002/1878-0261.13689. Epub 2024 Jun 17.
Liquid biopsy demonstrates excellent potential in patient management by providing a minimally invasive and cost-effective approach to detecting and monitoring cancer, even at its early stages. Due to the complexity of liquid biopsy data, machine-learning techniques are increasingly gaining attention in sample analysis, especially for multidimensional data such as RNA expression profiles. Yet, there is no agreement in the community on which methods are the most effective or how to process the data. To circumvent this, we performed a large-scale study using various machine-learning techniques. First, we took a closer look at existing datasets and filtered out some patients to assert data collection quality. The final data collection included platelet RNA samples acquired from 1397 cancer patients (17 types of cancer) and 354 asymptomatic, presumed healthy, donors. Then, we assessed an array of different machine-learning models and techniques (e.g., feature selection of RNA transcripts) in pan-cancer detection and multiclass classification. Our results show that simple logistic regression performs the best, reaching a 68% cancer detection rate at a 99% specificity level, and multiclass classification accuracy of 79.38% when distinguishing between five cancer types. In summary, by revisiting classical machine-learning models, we have exceeded the previously used method by 5% and 9.65% in cancer detection and multiclass classification, respectively. To ease further research, we open-source our code and data processing pipelines (https://gitlab.com/jopekmaksym/improving-platelet-rna-based-diagnostics), which we hope will serve the community as a strong baseline.
液体活检通过提供一种微创且具有成本效益的方法来检测和监测癌症,即使在早期阶段,也具有巨大的潜力,在患者管理方面显示出了卓越的应用前景。由于液体活检数据的复杂性,机器学习技术在样本分析中越来越受到关注,尤其是对于 RNA 表达谱等多维数据。然而,在社区中,哪种方法最有效或如何处理数据尚未达成共识。为了解决这个问题,我们使用了各种机器学习技术进行了大规模研究。首先,我们仔细研究了现有的数据集,并过滤掉了一些患者,以确保数据采集的质量。最终的数据采集包括从 1397 名癌症患者(17 种癌症)和 354 名无症状、假定健康的供体中获得的血小板 RNA 样本。然后,我们评估了一系列不同的机器学习模型和技术(例如,RNA 转录本的特征选择)在泛癌症检测和多类分类中的应用。我们的研究结果表明,简单的逻辑回归表现最佳,在达到 99%特异性水平时,癌症检测率为 68%,在区分五种癌症类型时,多类分类准确率为 79.38%。总的来说,通过重新审视经典的机器学习模型,我们在癌症检测和多类分类方面分别提高了之前使用方法的 5%和 9.65%。为了方便进一步的研究,我们开源了我们的代码和数据处理管道(https://gitlab.com/jopekmaksym/improving-platelet-rna-based-diagnostics),我们希望这些代码和数据处理管道能够作为一个强大的基准,为社区提供服务。