Suppr超能文献

基于特征的分类器用于肿瘤-正常配对测序数据中的体细胞突变检测。

Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing data.

机构信息

Department of Molecular Oncology, BC Cancer Agency, Vancouver, BC, Canada.

出版信息

Bioinformatics. 2012 Jan 15;28(2):167-75. doi: 10.1093/bioinformatics/btr629. Epub 2011 Nov 13.

Abstract

MOTIVATION

The study of cancer genomes now routinely involves using next-generation sequencing technology (NGS) to profile tumours for single nucleotide variant (SNV) somatic mutations. However, surprisingly few published bioinformatics methods exist for the specific purpose of identifying somatic mutations from NGS data and existing tools are often inaccurate, yielding intolerably high false prediction rates. As such, the computational problem of accurately inferring somatic mutations from paired tumour/normal NGS data remains an unsolved challenge.

RESULTS

We present the comparison of four standard supervised machine learning algorithms for the purpose of somatic SNV prediction in tumour/normal NGS experiments. To evaluate these approaches (random forest, Bayesian additive regression tree, support vector machine and logistic regression), we constructed 106 features representing 3369 candidate somatic SNVs from 48 breast cancer genomes, originally predicted with naive methods and subsequently revalidated to establish ground truth labels. We trained the classifiers on this data (consisting of 1015 true somatic mutations and 2354 non-somatic mutation positions) and conducted a rigorous evaluation of these methods using a cross-validation framework and hold-out test NGS data from both exome capture and whole genome shotgun platforms. All learning algorithms employing predictive discriminative approaches with feature selection improved the predictive accuracy over standard approaches by statistically significant margins. In addition, using unsupervised clustering of the ground truth 'false positive' predictions, we noted several distinct classes and present evidence suggesting non-overlapping sources of technical artefacts illuminating important directions for future study.

AVAILABILITY

Software called MutationSeq and datasets are available from http://compbio.bccrc.ca.

摘要

动机

癌症基因组的研究现在通常涉及使用下一代测序技术 (NGS) 对肿瘤进行单核苷酸变异 (SNV) 体细胞突变分析。然而,目前用于从 NGS 数据中识别体细胞突变的生物信息学方法却少之又少,而且现有的工具通常不够准确,导致无法接受的高假阳性预测率。因此,从配对肿瘤/正常 NGS 数据中准确推断体细胞突变仍然是一个未解决的难题。

结果

我们提出了四种标准监督机器学习算法在肿瘤/正常 NGS 实验中用于体细胞 SNV 预测的比较。为了评估这些方法(随机森林、贝叶斯加法回归树、支持向量机和逻辑回归),我们构建了 106 个特征,代表了从 48 个乳腺癌基因组中最初用朴素方法预测并随后重新验证以建立真实标签的 3369 个候选体细胞 SNV。我们在这些数据上训练了分类器(包含 1015 个真正的体细胞突变和 2354 个非体细胞突变位置),并使用交叉验证框架和来自外显子捕获和全基因组鸟枪法平台的保留测试 NGS 数据对这些方法进行了严格的评估。所有采用预测判别方法和特征选择的学习算法都通过统计学上显著的优势提高了预测准确性。此外,通过对真实“假阳性”预测的无监督聚类,我们注意到了几个不同的类别,并提供了证据表明,非重叠的技术伪影来源为未来的研究指明了重要方向。

可用性

名为 MutationSeq 的软件和数据集可从 http://compbio.bccrc.ca 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1ea7/3259434/39a404eff3c3/btr629f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验