通过将基因组特征与质量指标相结合，提高假阳性单核苷酸变异的过滤效果。

Improving the filtering of false positive single nucleotide variations by combining genomic features with quality metrics.

机构信息

Department of Computer Engineering, Kocaeli University, Kocaeli 41000, Turkey.

R&D Department, Idea Technology Solutions LLC., Istanbul 34396, Turkey.

出版信息

Bioinformatics. 2023 Dec 1;39(12). doi: 10.1093/bioinformatics/btad694.

DOI:10.1093/bioinformatics/btad694

PMID:38019945

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10692869/

Abstract

MOTIVATION

Technical errors in sequencing or bioinformatics steps and difficulties in alignment at some genomic sites result in false positive (FP) variants. Filtering based on quality metrics is a common method for detecting FP variants, but setting thresholds to reduce FP rates may reduce the number of true positive variants by overlooking the more complex relationships between features. The goal of this study is to develop a machine learning-based model for identifying FPs that integrates quality metrics with genomic features and with the feature interpretability property to provide insights into model results.

RESULTS

We propose a random forest-based model that utilizes genomic features to improve identification of FPs. Further examination of the features shows that the newly introduced features have an important impact on the prediction of variants misclassified by VEF, GATK-CNN, and GARFIELD, recently introduced FP detection systems. We applied cost-sensitive training to avoid errors in misclassification of true variants and developed a model that provides a robust mechanism against misclassification of true variants while increasing the prediction rate of FP variants. This model can be easily re-trained when factors such as experimental protocols might alter the FP distribution. In addition, it has an interpretability mechanism that allows users to understand the impact of features on the model's predictions.

AVAILABILITY AND IMPLEMENTATION

The software implementation can be found at https://github.com/ideateknoloji/FPDetect.

摘要

动机

测序或生物信息学步骤中的技术错误以及某些基因组位置的对齐困难导致假阳性 (FP) 变体。基于质量指标进行过滤是检测 FP 变体的常用方法，但设置阈值以降低 FP 率可能会通过忽略特征之间更复杂的关系而减少真正的阳性变体数量。本研究的目的是开发一种基于机器学习的模型，用于识别 FP，该模型将质量指标与基因组特征以及特征可解释性属性相结合，以深入了解模型结果。

结果

我们提出了一种基于随机森林的模型，该模型利用基因组特征来提高 FP 的识别能力。进一步检查特征表明，新引入的特征对预测由最近引入的 FP 检测系统 VEF、GATK-CNN 和 GARFIELD 错误分类的变体具有重要影响。我们应用了代价敏感训练来避免对真变体的误分类错误，并开发了一种模型，该模型在增加 FP 变体的预测率的同时提供了一种稳健的机制来防止真变体的误分类。当实验方案等因素可能改变 FP 分布时，该模型可以轻松重新训练。此外，它具有解释机制，允许用户了解特征对模型预测的影响。

可用性和实现

软件实现可在 https://github.com/ideateknoloji/FPDetect 找到。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b5f7/10692869/f68c700b9d81/btad694f1.jpg

相似文献

Improving the filtering of false positive single nucleotide variations by combining genomic features with quality metrics.通过将基因组特征与质量指标相结合，提高假阳性单核苷酸变异的过滤效果。

Bioinformatics. 2023 Dec 1;39(12). doi: 10.1093/bioinformatics/btad694.

VEF: a variant filtering tool based on ensemble methods.VEF：一种基于集成方法的变异过滤工具。

Bioinformatics. 2020 Apr 15;36(8):2328-2336. doi: 10.1093/bioinformatics/btz952.

GARFIELD-NGS: Genomic vARiants FIltering by dEep Learning moDels in NGS.GARFIELD-NGS：基于深度学习模型的 NGS 中基因组变异过滤。

Bioinformatics. 2018 Sep 1;34(17):3038-3040. doi: 10.1093/bioinformatics/bty303.

Machine learning random forest for predicting oncosomatic variant NGS analysis.机器学习随机森林预测肿瘤体细胞变异 NGS 分析。

Sci Rep. 2021 Nov 8;11(1):21820. doi: 10.1038/s41598-021-01253-y.

Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data.利用基因型阵列数据比较多样本和单样本变异检测结果，并改进来自深度覆盖全基因组测序数据的变异检测集。

Bioinformatics. 2017 Apr 15;33(8):1147-1153. doi: 10.1093/bioinformatics/btw786.

FVC as an adaptive and accurate method for filtering variants from popular NGS analysis pipelines.FVC 是一种自适应且准确的方法，可用于从流行的 NGS 分析管道中筛选变体。

Commun Biol. 2022 Sep 16;5(1):975. doi: 10.1038/s42003-022-03397-7.

An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome.利用来自小型真核生物基因组的模拟读数对单核苷酸多态性假阳性原因的调查。

BMC Bioinformatics. 2015 Nov 11;16:382. doi: 10.1186/s12859-015-0801-z.

tarSVM: Improving the accuracy of variant calls derived from microfluidic PCR-based targeted next generation sequencing using a support vector machine.tarSVM：使用支持向量机提高基于微流控PCR的靶向新一代测序得出的变异检测准确性。

BMC Bioinformatics. 2016 Jun 10;17(1):233. doi: 10.1186/s12859-016-1108-4.

GLANET: genomic loci annotation and enrichment tool.GLANET：基因组位点注释和富集工具。

Bioinformatics. 2017 Sep 15;33(18):2818-2828. doi: 10.1093/bioinformatics/btx326.

Precise detection of de novo single nucleotide variants in human genomes.精准检测人类基因组中的新单核苷酸变异。

Proc Natl Acad Sci U S A. 2018 May 22;115(21):5516-5521. doi: 10.1073/pnas.1802244115. Epub 2018 May 7.

引用本文的文献

Enhancing Clinical Applications by Evaluation of Sensitivity and Specificity in Whole Exome Sequencing.通过评估全外显子组测序中的敏感性和特异性来增强临床应用。

Int J Mol Sci. 2024 Dec 10;25(24):13250. doi: 10.3390/ijms252413250.

本文引用的文献

The European Nucleotide Archive in 2021.2021 年的欧洲核苷酸档案库。

Nucleic Acids Res. 2022 Jan 7;50(D1):D106-D110. doi: 10.1093/nar/gkab1051.

Reducing Sanger confirmation testing through false positive prediction algorithms.通过假阳性预测算法减少桑格确认测试。

Genet Med. 2021 Jul;23(7):1255-1262. doi: 10.1038/s41436-021-01148-3. Epub 2021 Mar 25.

Probability of change in life: Amino acid changes in single nucleotide substitutions.生命变化的可能性：单核苷酸替换中的氨基酸变化。

Biosystems. 2020 Jun;193-194:104135. doi: 10.1016/j.biosystems.2020.104135. Epub 2020 Apr 4.

VEF: a variant filtering tool based on ensemble methods.VEF：一种基于集成方法的变异过滤工具。

Bioinformatics. 2020 Apr 15;36(8):2328-2336. doi: 10.1093/bioinformatics/btz952.

Lean and deep models for more accurate filtering of SNP and INDEL variant calls.用于更准确筛选 SNP 和 INDEL 变异体调用的精简且深入的模型。

Bioinformatics. 2020 Apr 1;36(7):2060-2067. doi: 10.1093/bioinformatics/btz901.

An open resource for accurately benchmarking small variant and reference calls.用于准确基准测试小型变体和参考调用的开放资源。

Nat Biotechnol. 2019 May;37(5):561-566. doi: 10.1038/s41587-019-0074-6. Epub 2019 Apr 1.

A Rigorous Interlaboratory Examination of the Need to Confirm Next-Generation Sequencing-Detected Variants with an Orthogonal Method in Clinical Genetic Testing.临床基因检测中采用正交方法确认下一代测序检测到的变异体必要性的严格实验室间检验

J Mol Diagn. 2019 Mar;21(2):318-329. doi: 10.1016/j.jmoldx.2018.10.009. Epub 2019 Jan 3.

A universal SNP and small-indel variant caller using deep neural networks.使用深度神经网络的通用 SNP 和小插入缺失变体调用器。

Nat Biotechnol. 2018 Nov;36(10):983-987. doi: 10.1038/nbt.4235. Epub 2018 Sep 24.

GARFIELD-NGS: Genomic vARiants FIltering by dEep Learning moDels in NGS.GARFIELD-NGS：基于深度学习模型的 NGS 中基因组变异过滤。

Bioinformatics. 2018 Sep 1;34(17):3038-3040. doi: 10.1093/bioinformatics/bty303.

A machine learning model to determine the accuracy of variant calls in capture-based next generation sequencing.基于捕获的下一代测序中变异调用准确性的机器学习模型。

BMC Genomics. 2018 Apr 17;19(1):263. doi: 10.1186/s12864-018-4659-0.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

通过将基因组特征与质量指标相结合，提高假阳性单核苷酸变异的过滤效果。

Improving the filtering of false positive single nucleotide variations by combining genomic features with quality metrics.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献