利用配对末端短读长进行准确的插入缺失预测。

Accurate indel prediction using paired-end short reads.

机构信息

Machine Learning and Computational Biology Research Group, Max Planck Institute for Developmental Biology and Max Planck Institute for Intelligent Systems, Tübingen, Germany.

出版信息

BMC Genomics. 2013 Feb 27;14:132. doi: 10.1186/1471-2164-14-132.

DOI:10.1186/1471-2164-14-132

PMID:23442375

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3614465/

Abstract

BACKGROUND

One of the major open challenges in next generation sequencing (NGS) is the accurate identification of structural variants such as insertions and deletions (indels). Current methods for indel calling assign scores to different types of evidence or counter-evidence for the presence of an indel, such as the number of split read alignments spanning the boundaries of a deletion candidate or reads that map within a putative deletion. Candidates with a score above a manually defined threshold are then predicted to be true indels. As a consequence, structural variants detected in this manner contain many false positives.

RESULTS

Here, we present a machine learning based method which is able to discover and distinguish true from false indel candidates in order to reduce the false positive rate. Our method identifies indel candidates using a discriminative classifier based on features of split read alignment profiles and trained on true and false indel candidates that were validated by Sanger sequencing. We demonstrate the usefulness of our method with paired-end Illumina reads from 80 genomes of the first phase of the 1001 Genomes Project ( http://www.1001genomes.org) in Arabidopsis thaliana.

CONCLUSION

In this work we show that indel classification is a necessary step to reduce the number of false positive candidates. We demonstrate that missing classification may lead to spurious biological interpretations. The software is available at: http://agkb.is.tuebingen.mpg.de/Forschung/SV-M/.

摘要

背景

下一代测序（NGS）中的主要开放性挑战之一是准确识别插入和缺失（indels）等结构变体。当前用于 indel 调用的方法会为 indel 存在的不同类型的证据或反证据分配分数，例如跨越删除候选边界的拆分读取对齐的数量或映射到假定删除区域内的读取。具有高于手动定义阈值的分数的候选者然后被预测为真正的 indels。因此，以这种方式检测到的结构变体包含许多假阳性。

结果

在这里，我们提出了一种基于机器学习的方法，能够发现和区分真正的和假的 indel 候选者，以降低假阳性率。我们的方法使用基于拆分读取对齐轮廓特征的判别分类器来识别 indel 候选者，并在通过 Sanger 测序验证的真实和假 indel 候选者上进行训练。我们使用来自拟南芥第一阶段 1001 基因组计划（http://www.1001genomes.org）的 80 个基因组的配对末端 Illumina 读取来证明我们方法的有效性。

结论

在这项工作中，我们表明 indel 分类是减少假阳性候选者数量的必要步骤。我们证明了缺失分类可能导致虚假的生物学解释。该软件可在：http://agkb.is.tuebingen.mpg.de/Forschung/SV-M/ 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0b6b/3614465/d57f9955529d/1471-2164-14-132-1.jpg

相似文献

Accurate indel prediction using paired-end short reads.

BMC Genomics. 2013 Feb 27;14:132. doi: 10.1186/1471-2164-14-132.

The challenge of detecting indels in bacterial genomes from short-read sequencing data.

J Biotechnol. 2017 May 20;250:11-15. doi: 10.1016/j.jbiotec.2017.02.026. Epub 2017 Mar 4.

Dindel: accurate indel calls from short-read data.

Genome Res. 2011 Jun;21(6):961-73. doi: 10.1101/gr.112326.110. Epub 2010 Oct 27.

SInC: an accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data.

BMC Bioinformatics. 2014 Feb 5;15:40. doi: 10.1186/1471-2105-15-40.

mInDel: a high-throughput and efficient pipeline for genome-wide InDel marker development.

BMC Genomics. 2016 Apr 14;17:290. doi: 10.1186/s12864-016-2614-5.

ARAMIS: From systematic errors of NGS long reads to accurate assemblies.

Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab170.

Amplicon Indel Hunter Is a Novel Bioinformatics Tool to Detect Large Somatic Insertion/Deletion Mutations in Amplicon-Based Next-Generation Sequencing Data.

J Mol Diagn. 2015 Nov;17(6):635-43. doi: 10.1016/j.jmoldx.2015.06.005. Epub 2015 Aug 28.

Vindel: a simple pipeline for checking indel redundancy.

BMC Bioinformatics. 2014 Nov 19;15(1):359. doi: 10.1186/s12859-014-0359-1.

Indel detection from RNA-seq data: tool evaluation and strategies for accurate detection of actionable mutations.

Brief Bioinform. 2017 Nov 1;18(6):973-983. doi: 10.1093/bib/bbw069.

Comparative assessments of indel annotations in healthy and cancer genomes with next-generation sequencing data.

BMC Med Genomics. 2020 Nov 10;13(1):170. doi: 10.1186/s12920-020-00818-6.

引用本文的文献

Optimizing Insertion and Deletion Detection Using Next-Generation Sequencing in the Clinical Laboratory.

J Mol Diagn. 2022 Dec;24(12):1217-1231. doi: 10.1016/j.jmoldx.2022.08.006. Epub 2022 Sep 24.

Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software.

Nat Commun. 2019 Jul 19;10(1):3240. doi: 10.1038/s41467-019-11146-4.

A Comprehensive Study of De Novo Genome Assemblers: Current Challenges and Future Prospective.

Evol Bioinform Online. 2018 Feb 20;14:1176934318758650. doi: 10.1177/1176934318758650. eCollection 2018.

InDel marker detection by integration of multiple softwares using machine learning techniques.

BMC Bioinformatics. 2016 Nov 2;17(1):548. doi: 10.1186/s12859-016-1312-2.

SPAI: an interactive platform for indel analysis.

BMC Genomics. 2016 Aug 31;17 Suppl 5(Suppl 5):496. doi: 10.1186/s12864-016-2824-x.

Performance evaluation of indel calling tools using real short-read data.

Hum Genomics. 2015 Aug 19;9(1):20. doi: 10.1186/s40246-015-0042-2.

The pattern of DNA cleavage intensity around indels.

Sci Rep. 2015 Feb 9;5:8333. doi: 10.1038/srep08333.

MIPSTR: a method for multiplex genotyping of germline and somatic STR variation across many individuals.

Genome Res. 2015 May;25(5):750-61. doi: 10.1101/gr.182212.114. Epub 2015 Feb 6.

Are differences in genomic data sets due to true biological variants or errors in genome assembly: an example from two chloroplast genomes.

PLoS One. 2015 Feb 6;10(2):e0118019. doi: 10.1371/journal.pone.0118019. eCollection 2015.

Century-scale methylome stability in a recently diverged Arabidopsis thaliana lineage.

PLoS Genet. 2015 Jan 8;11(1):e1004920. doi: 10.1371/journal.pgen.1004920. eCollection 2015 Jan.

本文引用的文献

SVseq: an approach for detecting exact breakpoints of deletions with low-coverage sequence data.

Bioinformatics. 2011 Dec 1;27(23):3228-34. doi: 10.1093/bioinformatics/btr563. Epub 2011 Oct 12.

Multiple reference genomes and transcriptomes for Arabidopsis thaliana.

Nature. 2011 Aug 28;477(7365):419-23. doi: 10.1038/nature10414.

Whole-genome sequencing of multiple Arabidopsis thaliana populations.

Nat Genet. 2011 Aug 28;43(10):956-63. doi: 10.1038/ng.911.

Identification of genomic indels and structural variations using split reads.

BMC Genomics. 2011 Jul 25;12:375. doi: 10.1186/1471-2164-12-375.

Reference-guided assembly of four diverse Arabidopsis thaliana genomes.

Proc Natl Acad Sci U S A. 2011 Jun 21;108(25):10249-54. doi: 10.1073/pnas.1107739108. Epub 2011 Jun 6.

A framework for variation discovery and genotyping using next-generation DNA sequencing data.

Nat Genet. 2011 May;43(5):491-8. doi: 10.1038/ng.806. Epub 2011 Apr 10.

Genome structural variation discovery and genotyping.

Nat Rev Genet. 2011 May;12(5):363-76. doi: 10.1038/nrg2958. Epub 2011 Mar 1.

Mapping copy number variation by population-scale genome sequencing.

Nature. 2011 Feb 3;470(7332):59-65. doi: 10.1038/nature09708.

AGE: defining breakpoints of genomic structural variants at single-nucleotide resolution, through optimal alignments with gap excision.

Bioinformatics. 2011 Mar 1;27(5):595-603. doi: 10.1093/bioinformatics/btq713. Epub 2011 Jan 13.

Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines.

Nature. 2010 Jun 3;465(7298):627-31. doi: 10.1038/nature08800. Epub 2010 Mar 24.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用配对末端短读长进行准确的插入缺失预测。

Accurate indel prediction using paired-end short reads.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献