Suppr超能文献

基于 DNA 甲基化的中枢神经系统肿瘤有监督分类的半监督学习综合研究。

Comprehensive study of semi-supervised learning for DNA methylation-based supervised classification of central nervous system tumors.

机构信息

Department of Pathology, St. Jude Children's Research Hospital, 262 Danny Thomas Place, MS 250, Memphis, TN, 38105-3678, USA.

出版信息

BMC Bioinformatics. 2022 Jun 8;23(1):223. doi: 10.1186/s12859-022-04764-1.

Abstract

BACKGROUND

Precision medicine for cancer treatment relies on an accurate pathological diagnosis. The number of known tumor classes has increased rapidly, and reliance on traditional methods of histopathologic classification alone has become unfeasible. To help reduce variability, validation costs, and standardize the histopathological diagnostic process, supervised machine learning models using DNA-methylation data have been developed for tumor classification. These methods require large labeled training data sets to obtain clinically acceptable classification accuracy. While there is abundant unlabeled epigenetic data across multiple databases, labeling pathology data for machine learning models is time-consuming and resource-intensive, especially for rare tumor types. Semi-supervised learning (SSL) approaches have been used to maximize the utility of labeled and unlabeled data for classification tasks and are effectively applied in genomics. SSL methods have not yet been explored with epigenetic data nor demonstrated beneficial to central nervous system (CNS) tumor classification.

RESULTS

This paper explores the application of semi-supervised machine learning on methylation data to improve the accuracy of supervised learning models in classifying CNS tumors. We comprehensively evaluated 11 SSL methods and developed a novel combination approach that included a self-training with editing using support vector machine (SETRED-SVM) model and an L2-penalized, multinomial logistic regression model to obtain high confidence labels from a few labeled instances. Results across eight random forest and neural net models show that the pseudo-labels derived from our SSL method can significantly increase prediction accuracy for 82 CNS tumors and 9 normal controls.

CONCLUSIONS

The proposed combination of semi-supervised technique and multinomial logistic regression holds the potential to leverage the abundant publicly available unlabeled methylation data effectively. Such an approach is highly beneficial in providing additional training examples, especially for scarce tumor types, to boost the prediction accuracy of supervised models.

摘要

背景

癌症治疗的精准医学依赖于准确的病理诊断。已知肿瘤类型的数量迅速增加,仅依靠传统的组织病理学分类方法已变得不可行。为了帮助减少变异性、验证成本并使组织病理学诊断过程标准化,已经开发了使用 DNA 甲基化数据的监督机器学习模型来进行肿瘤分类。这些方法需要大型标记训练数据集才能获得临床可接受的分类准确性。虽然在多个数据库中存在大量未标记的表观遗传数据,但为机器学习模型标记病理学数据既耗时又资源密集,尤其是对于罕见的肿瘤类型。半监督学习 (SSL) 方法已被用于最大限度地利用标记和未标记数据进行分类任务,并且在基因组学中得到了有效应用。SSL 方法尚未在表观遗传数据中进行探索,也未证明对中枢神经系统 (CNS) 肿瘤分类有益。

结果

本文探讨了半监督机器学习在甲基化数据中的应用,以提高监督学习模型在分类 CNS 肿瘤中的准确性。我们全面评估了 11 种 SSL 方法,并开发了一种新的组合方法,该方法包括使用支持向量机 (SETRED-SVM) 模型进行自我训练和编辑以及 L2 正则化的多项逻辑回归模型,以从少数标记实例中获得高置信度标签。八个随机森林和神经网络模型的结果表明,我们的 SSL 方法得出的伪标签可以显著提高 82 个 CNS 肿瘤和 9 个正常对照的预测准确性。

结论

所提出的半监督技术和多项逻辑回归组合方法具有有效利用大量可用未标记甲基化数据的潜力。这种方法非常有益于提供额外的训练示例,特别是对于稀缺的肿瘤类型,以提高监督模型的预测准确性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/53a0/9178802/aa69851cf763/12859_2022_4764_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验