Suppr超能文献

LogLoss-BERAF:一种基于集成的机器学习模型,用于构建高度准确的前列腺癌甲基化位点诊断集,同时考虑异质性。

LogLoss-BERAF: An ensemble-based machine learning model for constructing highly accurate diagnostic sets of methylation sites accounting for heterogeneity in prostate cancer.

机构信息

Federal Research and Clinical Center of Physical-Chemical Medicine of Federal Medical Biological Agency, Moscow, Russian Federation.

Moscow Institute of Physics and Technology (State University), Dolgoprudny, Moscow Region, Russian Federation.

出版信息

PLoS One. 2018 Nov 2;13(11):e0204371. doi: 10.1371/journal.pone.0204371. eCollection 2018.

Abstract

Although modern methods of whole genome DNA methylation analysis have a wide range of applications, they are not suitable for clinical diagnostics due to their high cost and complexity and due to the large amount of sample DNA required for the analysis. Therefore, it is crucial to be able to identify a relatively small number of methylation sites that provide high precision and sensitivity for the diagnosis of pathological states. We propose an algorithm for constructing limited subsamples from high-dimensional data to form diagnostic panels. We have developed a tool that utilizes different methods of selection to find an optimal, minimum necessary combination of factors using cross-entropy loss metrics (LogLoss) to identify a subset of methylation sites. We show that the algorithm can work effectively with different genome methylation patterns using ensemble-based machine learning methods. Algorithm efficiency, precision and robustness were evaluated using five genome-wide DNA methylation datasets (totaling 626 samples), and each dataset was classified into tumor and non-tumor samples. The algorithm produced an AUC of 0.97 (95% CI: 0.94-0.99, 9 sites) for prostate adenocarcinoma and an AUC of 1.0 (from 2 to 6 sites) for urothelial bladder carcinoma, two types of kidney carcinoma and colorectal carcinoma. For prostate adenocarcinoma we showed that identified differential variability methylation patterns distinguish cluster of samples with higher recurrence rate (hazard ratio for recurrence = 0.48, 95% CI: 0.05-0.92; log-rank test, p-value < 0.03). We also identified several clusters of correlated interchangeable methylation sites that can be used for the elaboration of biological interpretation of the resulting models and for further selection of the sites most suitable for designing diagnostic panels. LogLoss-BERAF is implemented as a standalone python code and open-source code is freely available from https://github.com/bioinformatics-IBCH/logloss-beraf along with the models described in this article.

摘要

尽管现代全基因组 DNA 甲基化分析方法具有广泛的应用,但由于其成本高、复杂性高以及分析所需的大量样本 DNA,因此并不适合临床诊断。因此,能够识别出少量提供高精度和高灵敏度的甲基化位点对于病理状态的诊断至关重要。我们提出了一种从高维数据中构建有限子样本以形成诊断面板的算法。我们开发了一种工具,该工具利用不同的选择方法,使用交叉熵损失度量(LogLoss)来寻找最佳的、最小的因素组合,以识别出一组甲基化位点。我们表明,该算法可以使用基于集合的机器学习方法有效地处理不同的基因组甲基化模式。使用五个全基因组 DNA 甲基化数据集(共 626 个样本)评估了算法的效率、精度和稳健性,每个数据集均分为肿瘤和非肿瘤样本。该算法对前列腺腺癌的 AUC 为 0.97(95%CI:0.94-0.99,9 个位点),对尿路上皮膀胱癌的 AUC 为 1.0(2-6 个位点),对两种肾癌和结直肠癌的 AUC 为 1.0。对于前列腺腺癌,我们表明,所识别的差异变异甲基化模式可区分具有更高复发率的样本簇(复发风险比=0.48,95%CI:0.05-0.92;对数秩检验,p 值<0.03)。我们还鉴定了几个相关的可互换的甲基化位点簇,这些位点簇可用于对所得模型进行生物学解释的阐述,并进一步选择最适合设计诊断面板的位点。LogLoss-BERAF 作为一个独立的 Python 代码实现,开源代码可从 https://github.com/bioinformatics-IBCH/logloss-beraf 免费获取,以及本文描述的模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7de5/6214495/a932880310a0/pone.0204371.g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验