Suppr超能文献

通过将基因组特征与质量指标相结合,提高假阳性单核苷酸变异的过滤效果。

Improving the filtering of false positive single nucleotide variations by combining genomic features with quality metrics.

机构信息

Department of Computer Engineering, Kocaeli University, Kocaeli 41000, Turkey.

R&D Department, Idea Technology Solutions LLC., Istanbul 34396, Turkey.

出版信息

Bioinformatics. 2023 Dec 1;39(12). doi: 10.1093/bioinformatics/btad694.

Abstract

MOTIVATION

Technical errors in sequencing or bioinformatics steps and difficulties in alignment at some genomic sites result in false positive (FP) variants. Filtering based on quality metrics is a common method for detecting FP variants, but setting thresholds to reduce FP rates may reduce the number of true positive variants by overlooking the more complex relationships between features. The goal of this study is to develop a machine learning-based model for identifying FPs that integrates quality metrics with genomic features and with the feature interpretability property to provide insights into model results.

RESULTS

We propose a random forest-based model that utilizes genomic features to improve identification of FPs. Further examination of the features shows that the newly introduced features have an important impact on the prediction of variants misclassified by VEF, GATK-CNN, and GARFIELD, recently introduced FP detection systems. We applied cost-sensitive training to avoid errors in misclassification of true variants and developed a model that provides a robust mechanism against misclassification of true variants while increasing the prediction rate of FP variants. This model can be easily re-trained when factors such as experimental protocols might alter the FP distribution. In addition, it has an interpretability mechanism that allows users to understand the impact of features on the model's predictions.

AVAILABILITY AND IMPLEMENTATION

The software implementation can be found at https://github.com/ideateknoloji/FPDetect.

摘要

动机

测序或生物信息学步骤中的技术错误以及某些基因组位置的对齐困难导致假阳性 (FP) 变体。基于质量指标进行过滤是检测 FP 变体的常用方法,但设置阈值以降低 FP 率可能会通过忽略特征之间更复杂的关系而减少真正的阳性变体数量。本研究的目的是开发一种基于机器学习的模型,用于识别 FP,该模型将质量指标与基因组特征以及特征可解释性属性相结合,以深入了解模型结果。

结果

我们提出了一种基于随机森林的模型,该模型利用基因组特征来提高 FP 的识别能力。进一步检查特征表明,新引入的特征对预测由最近引入的 FP 检测系统 VEF、GATK-CNN 和 GARFIELD 错误分类的变体具有重要影响。我们应用了代价敏感训练来避免对真变体的误分类错误,并开发了一种模型,该模型在增加 FP 变体的预测率的同时提供了一种稳健的机制来防止真变体的误分类。当实验方案等因素可能改变 FP 分布时,该模型可以轻松重新训练。此外,它具有解释机制,允许用户了解特征对模型预测的影响。

可用性和实现

软件实现可在 https://github.com/ideateknoloji/FPDetect 找到。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b5f7/10692869/f68c700b9d81/btad694f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验