Cygert Sebastian, Pastuszak Krzysztof, Górski Franciszek, Sieczczyński Michał, Juszczyk Piotr, Rutkowski Antoni, Lewalski Sebastian, Różański Robert, Jopek Maksym Albin, Jassem Jacek, Czyżewski Andrzej, Wurdinger Thomas, Best Myron G, Żaczek Anna J, Supernat Anna
Department of Multimedia Systems, Faculty of Electronics, Telecommunication and Informatics, Gdansk University of Technology, 80-233 Gdańsk, Poland.
Ideas NCBR, 00-801 Warsaw, Poland.
Cancers (Basel). 2023 Apr 17;15(8):2336. doi: 10.3390/cancers15082336.
Liquid biopsies offer minimally invasive diagnosis and monitoring of cancer disease. This biosource is often analyzed using sequencing, which generates highly complex data that can be used using machine learning tools. Nevertheless, validating the clinical applications of such methods is challenging. It requires: (a) using data from many patients; (b) verifying potential bias concerning sample collection; and (c) adding interpretability to the model. In this work, we have used RNA sequencing data of tumor-educated platelets (TEPs) and performed a binary classification (cancer vs. no-cancer). First, we compiled a large-scale dataset with more than a thousand donors. Further, we used different convolutional neural networks (CNNs) and boosting methods to evaluate the classifier performance. We have obtained an impressive result of 0.96 area under the curve. We then identified different clusters of splice variants using expert knowledge from the Kyoto Encyclopedia of Genes and Genomes (KEGG). Employing boosting algorithms, we identified the features with the highest predictive power. Finally, we tested the robustness of the models using test data from novel hospitals. Notably, we did not observe any decrease in model performance. Our work proves the great potential of using TEP data for cancer patient classification and opens the avenue for profound cancer diagnostics.
液体活检为癌症疾病提供了微创诊断和监测手段。这种生物样本来源通常通过测序进行分析,测序会生成高度复杂的数据,可使用机器学习工具来处理这些数据。然而,验证此类方法的临床应用具有挑战性。这需要:(a)使用来自众多患者的数据;(b)验证样本采集方面的潜在偏差;以及(c)增强模型的可解释性。在这项工作中,我们使用了肿瘤衍生血小板(TEP)的RNA测序数据,并进行了二元分类(癌症与非癌症)。首先,我们汇编了一个包含一千多名捐赠者的大规模数据集。此外,我们使用了不同的卷积神经网络(CNN)和增强方法来评估分类器性能。我们获得了曲线下面积为0.96的令人印象深刻的结果。然后,我们利用来自京都基因与基因组百科全书(KEGG)的专业知识识别了不同的剪接变体簇。通过使用增强算法,我们确定了具有最高预测能力的特征。最后,我们使用来自新医院的测试数据测试了模型的稳健性。值得注意的是,我们没有观察到模型性能有任何下降。我们的工作证明了使用TEP数据进行癌症患者分类的巨大潜力,并为深入的癌症诊断开辟了道路。