基于组学数据的机器学习方法预测血液分泌蛋白和肝癌潜在生物标志物

Machine learning approach to predict blood-secretory proteins and potential biomarkers for liver cancer using omics data.

机构信息

Department of Bioinformatics, Pondicherry University, Puducherry 605014, India.

出版信息

J Proteomics. 2024 Oct 30;309:105298. doi: 10.1016/j.jprot.2024.105298. Epub 2024 Aug 30.

DOI:10.1016/j.jprot.2024.105298

Abstract

Identifying non-invasive blood-based biomarkers is crucial for early detection and monitoring of liver cancer (LC), thereby improving patient outcomes. This study leveraged computational approaches to predict potential blood-based biomarkers for LC. Machine learning (ML) models were developed using selected features from blood-secretory proteins collected from the curated databases. The logistic regression (LR) model demonstrated the optimal performance. Transcriptome analysis across 7 LC cohorts revealed 231 common differentially expressed genes (DEGs). The encoded proteins of these DEGs were compared with the ML dataset, revealing 29 proteins overlapping with the blood-secretory dataset. The LR model also predicted 29 additional proteins as blood-secretory with the remaining protein-coding genes. As a result, 58 potential blood-secretory proteins were obtained. Among the top 20 genes, 13 common hub genes were identified. Further, area under the receiver operating characteristic curve (ROC AUC) analysis was performed to assess the genes as potential diagnostic blood biomarkers. Six genes, ESM1, FCN2, MDK, GPC3, CTHRC1 and COL6A6, exhibited an AUC value higher than 0.85 and were predicted as blood-secretory. This study highlights the potential of an integrative computational approach for discovering non-invasive blood-based biomarkers in LC, facilitating for further validation and clinical translation. SIGNIFICANCE: Liver cancer is one of the leading causes of premature death worldwide, with its prevalence and mortality rates projected to increase. Although current diagnostic methods are highly sensitive, they are invasive and unsuitable for repeated testing. Blood biomarkers offer a promising non-invasive alternative, but their wide dynamic range of protein concentration poses experimental challenges. Therefore, utilizing available omics data to develop a diagnostic model could provide a potential solution for accurate diagnosis. This study developed a computational method integrating machine learning and bioinformatics analysis to identify potential blood biomarkers. As a result, ESM1, FCN2, MDK, GPC3, CTHRC1 and COL6A6 biomarkers were identified, holding significant promise for improving diagnosis and understanding of liver cancer. The integrated method can be applied to other cancers, offering a possible solution for early detection and improved patient outcomes.

摘要

识别非侵入性的血液生物标志物对于肝癌（LC）的早期检测和监测至关重要，从而改善患者的预后。本研究利用计算方法来预测潜在的 LC 血液生物标志物。使用从经过整理的数据库中收集的血液分泌蛋白中选择的特征来开发机器学习（ML）模型。逻辑回归（LR）模型表现出最佳性能。对 7 个 LC 队列的转录组分析显示出 231 个共同差异表达基因（DEG）。这些 DEG 编码的蛋白质与 ML 数据集进行比较，发现 29 个蛋白质与血液分泌数据集重叠。LR 模型还预测了 29 个作为血液分泌的额外蛋白质，而其余的蛋白质编码基因也是如此。结果获得了 58 个潜在的血液分泌蛋白。在排名前 20 的基因中，确定了 13 个共同的枢纽基因。此外，还进行了接收器操作特征曲线（ROC AUC）分析，以评估这些基因作为潜在的诊断性血液生物标志物的性能。6 个基因，ESM1、FCN2、MDK、GPC3、CTHRC1 和 COL6A6，表现出 AUC 值高于 0.85，并被预测为血液分泌。本研究强调了综合计算方法在发现 LC 非侵入性血液生物标志物方面的潜力，为进一步验证和临床转化提供了便利。

意义

肝癌是全球导致过早死亡的主要原因之一，其发病率和死亡率预计将上升。尽管目前的诊断方法具有很高的灵敏度，但它们是侵入性的，不适合重复测试。血液生物标志物提供了一种有前途的非侵入性替代方法，但它们的蛋白质浓度广泛的动态范围带来了实验挑战。因此，利用现有的组学数据来开发诊断模型可能为准确诊断提供一个潜在的解决方案。本研究开发了一种计算方法，将机器学习和生物信息学分析相结合，以识别潜在的血液生物标志物。结果确定了 ESM1、FCN2、MDK、GPC3、CTHRC1 和 COL6A6 等生物标志物，为改善肝癌的诊断和理解提供了重要意义。该综合方法可应用于其他癌症，为早期检测和改善患者预后提供了可能的解决方案。