Suppr超能文献

利用 RNA-Seq 数据集和机器学习技术鉴定肝细胞癌(HCC)的新型转录生物标志物。

Identifying novel transcript biomarkers for hepatocellular carcinoma (HCC) using RNA-Seq datasets and machine learning.

机构信息

Department of Toxicogenomics, School of Oncology and Developmental Biology (GROW), Maastricht University, Maastricht, The Netherlands.

出版信息

BMC Cancer. 2021 Aug 27;21(1):962. doi: 10.1186/s12885-021-08704-9.

Abstract

BACKGROUND

Hepatocellular carcinoma (HCC) is one of the leading causes of cancer death in the world owing to limitations in its prognosis. The current prognosis approaches include radiological examination and detection of serum biomarkers, however, both have limited efficiency and are ineffective in early prognosis. Due to such limitations, we propose to use RNA-Seq data for evaluating putative higher accuracy biomarkers at the transcript level that could help in early prognosis.

METHODS

To identify such potential transcript biomarkers, RNA-Seq data for healthy liver and various HCC cell models were subjected to five different machine learning algorithms: random forest, K-nearest neighbor, Naïve Bayes, support vector machine, and neural networks. Various metrics, namely sensitivity, specificity, MCC, informedness, and AUC-ROC (except for support vector machine) were evaluated. The algorithms that produced the highest values for all metrics were chosen to extract the top features that were subjected to recursive feature elimination. Through recursive feature elimination, the least number of features were obtained to differentiate between the healthy and HCC cell models.

RESULTS

From the metrics used, it is demonstrated that the efficiency of the known protein biomarkers for HCC is comparatively lower than complete transcriptomics data. Among the different machine learning algorithms, random forest and support vector machine demonstrated the best performance. Using recursive feature elimination on top features of random forest and support vector machine three transcripts were selected that had an accuracy of 0.97 and kappa of 0.93. Of the three transcripts, two were protein coding (PARP2-202 and SPON2-203) and one was a non-coding transcript (CYREN-211). Lastly, we demonstrated that these three selected transcripts outperformed randomly taken three transcripts (15,000 combinations), hence were not chance findings, and could then be an interesting candidate for new HCC biomarker development.

CONCLUSION

Using RNA-Seq data combined with machine learning approaches can aid in finding novel transcript biomarkers. The three biomarkers identified: PARP2-202, SPON2-203, and CYREN-211, presented the highest accuracy among all other transcripts in differentiating the healthy and HCC cell models. The machine learning pipeline developed in this study can be used for any RNA-Seq dataset to find novel transcript biomarkers. Code: www.github.com/rajinder4489/ML_biomarkers.

摘要

背景

由于肝癌(HCC)预后存在局限性,它是全球癌症死亡的主要原因之一。目前的预后方法包括影像学检查和血清生物标志物检测,但这两种方法的效率都有限,对早期预后都无效。鉴于这些局限性,我们建议使用 RNA-Seq 数据来评估转录水平上可能具有更高准确性的潜在生物标志物,以帮助进行早期预后。

方法

为了鉴定这些潜在的转录生物标志物,对健康肝脏和各种 HCC 细胞模型的 RNA-Seq 数据进行了五种不同的机器学习算法分析:随机森林、K-最近邻、朴素贝叶斯、支持向量机和神经网络。评估了各种指标,包括敏感性、特异性、MCC、信息量和 AUC-ROC(支持向量机除外)。选择生成所有指标最高值的算法来提取顶级特征,然后对顶级特征进行递归特征消除。通过递归特征消除,获得了区分健康和 HCC 细胞模型所需的最少特征数量。

结果

从使用的指标来看,已知 HCC 蛋白生物标志物的效率明显低于完整的转录组数据。在不同的机器学习算法中,随机森林和支持向量机的表现最好。在随机森林和支持向量机的顶级特征上使用递归特征消除后,选择了三个转录本,其准确率为 0.97,kappa 值为 0.93。在这三个转录本中,有两个是编码蛋白的(PARP2-202 和 SPON2-203),一个是非编码转录本(CYREN-211)。最后,我们证明这三个选定的转录本优于随机选择的三个转录本(15000 种组合),因此不是偶然发现,可能成为新的 HCC 生物标志物开发的候选者。

结论

使用 RNA-Seq 数据结合机器学习方法可以帮助寻找新的转录生物标志物。在区分健康和 HCC 细胞模型方面,鉴定的三个生物标志物:PARP2-202、SPON2-203 和 CYREN-211 在所有其他转录本中表现出最高的准确性。本研究中开发的机器学习管道可用于任何 RNA-Seq 数据集,以找到新的转录本生物标志物。代码:www.github.com/rajinder4489/ML_biomarkers。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/13d2/8394105/fe3a0fda195f/12885_2021_8704_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验