利用来自小型真核生物基因组的模拟读数对单核苷酸多态性假阳性原因的调查。

An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome.

作者信息

Ribeiro Antonio, Golicz Agnieszka, Hackett Christine Anne, Milne Iain, Stephen Gordon, Marshall David, Flavell Andrew J, Bayer Micha

机构信息

The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, Scotland, UK.

Division of Plant Sciences, University of Dundee at JHI, Invergowrie, Dundee, DD2 5DA, Scotland, UK.

出版信息

BMC Bioinformatics. 2015 Nov 11;16:382. doi: 10.1186/s12859-015-0801-z.

DOI:10.1186/s12859-015-0801-z

PMID:26558718

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4642669/

Abstract

BACKGROUND

Single Nucleotide Polymorphisms (SNPs) are widely used molecular markers, and their use has increased massively since the inception of Next Generation Sequencing (NGS) technologies, which allow detection of large numbers of SNPs at low cost. However, both NGS data and their analysis are error-prone, which can lead to the generation of false positive (FP) SNPs. We explored the relationship between FP SNPs and seven factors involved in mapping-based variant calling - quality of the reference sequence, read length, choice of mapper and variant caller, mapping stringency and filtering of SNPs by read mapping quality and read depth. This resulted in 576 possible factor level combinations. We used error- and variant-free simulated reads to ensure that every SNP found was indeed a false positive.

RESULTS

The variation in the number of FP SNPs generated ranged from 0 to 36,621 for the 120 million base pairs (Mbp) genome. All of the experimental factors tested had statistically significant effects on the number of FP SNPs generated and there was a considerable amount of interaction between the different factors. Using a fragmented reference sequence led to a dramatic increase in the number of FP SNPs generated, as did relaxed read mapping and a lack of SNP filtering. The choice of reference assembler, mapper and variant caller also significantly affected the outcome. The effect of read length was more complex and suggests a possible interaction between mapping specificity and the potential for contributing more false positives as read length increases.

CONCLUSIONS

The choice of tools and parameters involved in variant calling can have a dramatic effect on the number of FP SNPs produced, with particularly poor combinations of software and/or parameter settings yielding tens of thousands in this experiment. Between-factor interactions make simple recommendations difficult for a SNP discovery pipeline but the quality of the reference sequence is clearly of paramount importance. Our findings are also a stark reminder that it can be unwise to use the relaxed mismatch settings provided as defaults by some read mappers when reads are being mapped to a relatively unfinished reference sequence from e.g. a non-model organism in its early stages of genomic exploration.

摘要

背景

单核苷酸多态性（SNPs）是广泛使用的分子标记，自新一代测序（NGS）技术出现以来，其使用量大幅增加，该技术能够以低成本检测大量的SNPs。然而，NGS数据及其分析都容易出错，这可能导致产生假阳性（FP）SNPs。我们探究了FP SNPs与基于映射的变异检测中涉及的七个因素之间的关系，这些因素包括参考序列的质量、读长、映射器和变异检测工具的选择、映射严格性以及通过读映射质量和读深度对SNPs进行过滤。这产生了576种可能的因素水平组合。我们使用无错误和无变异的模拟读段来确保发现的每个SNP确实是假阳性。

结果

对于1.2亿碱基对（Mbp）的基因组，产生的FP SNPs数量变化范围为0至36,621。所有测试的实验因素对产生的FP SNPs数量都有统计学上的显著影响，并且不同因素之间存在大量的相互作用。使用片段化的参考序列会导致产生的FP SNPs数量急剧增加，宽松的读映射和缺乏SNP过滤也会如此。参考序列组装器、映射器和变异检测工具的选择也显著影响结果。读长的影响更为复杂，这表明映射特异性与随着读长增加产生更多假阳性的可能性之间可能存在相互作用。

结论

变异检测中所涉及的工具和参数的选择可能对产生的FP SNPs数量产生巨大影响，在本实验中，软件和/或参数设置的特别差的组合会产生数以万计的FP SNPs。因素之间的相互作用使得为SNP发现流程提供简单的建议变得困难，但参考序列的质量显然至关重要。我们的研究结果也强烈提醒，当将读段映射到例如处于基因组探索早期阶段的非模式生物的相对未完成的参考序列时，使用一些读映射器提供的默认宽松错配设置可能是不明智的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f891/4642669/e907762b3626/12859_2015_801_Fig1_HTML.jpg

相似文献

An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome.

BMC Bioinformatics. 2015 Nov 11;16:382. doi: 10.1186/s12859-015-0801-z.

Annotation-based genome-wide SNP discovery in the large and complex Aegilops tauschii genome using next-generation sequencing without a reference genome sequence.

BMC Genomics. 2011 Jan 25;12:59. doi: 10.1186/1471-2164-12-59.

Specificity control for read alignments using an artificial reference genome-guided false discovery rate.

Bioinformatics. 2014 Jan 1;30(1):9-16. doi: 10.1093/bioinformatics/btt255. Epub 2013 May 17.

Whole-Genome Sequence Accuracy Is Improved by Replication in a Population of Mutagenized Sorghum.

G3 (Bethesda). 2018 Mar 2;8(3):1079-1094. doi: 10.1534/g3.117.300301.

Improving mapping and SNP-calling performance in multiplexed targeted next-generation sequencing.

BMC Genomics. 2012 Aug 22;13:417. doi: 10.1186/1471-2164-13-417.

Read trimming has minimal effect on bacterial SNP-calling accuracy.

Microb Genom. 2020 Dec;6(12). doi: 10.1099/mgen.0.000434. Epub 2020 Dec 11.

A high-throughput SNP discovery strategy for RNA-seq data.

BMC Genomics. 2019 Feb 27;20(1):160. doi: 10.1186/s12864-019-5533-4.

Coverage-based consensus calling (CbCC) of short sequence reads and comparison of CbCC results to identify SNPs in chickpea (Cicer arietinum; Fabaceae), a crop species without a reference genome.

Am J Bot. 2012 Feb;99(2):186-92. doi: 10.3732/ajb.1100419. Epub 2012 Feb 1.

Choice of reference sequence and assembler for alignment of Listeria monocytogenes short-read sequence data greatly influences rates of error in SNP analyses.

PLoS One. 2014 Aug 21;9(8):e104579. doi: 10.1371/journal.pone.0104579. eCollection 2014.

Evaluation and assessment of read-mapping by multiple next-generation sequencing aligners based on genome-wide characteristics.

Genomics. 2017 Jul;109(3-4):186-191. doi: 10.1016/j.ygeno.2017.03.001. Epub 2017 Mar 9.

引用本文的文献

F-Based Marker Prioritization Within Quantitative Trait Loci Regions and Its Impact on Genomic Selection Accuracy: Insights from a Simulation Study with High-Density Marker Panels for Bovines.

Genes (Basel). 2025 May 10;16(5):563. doi: 10.3390/genes16050563.

Whole-genome comparison using complete genomes from strains revealed single nucleotide polymorphisms on non-genomic islands for subspecies differentiation.

Front Microbiol. 2024 Sep 12;15:1452564. doi: 10.3389/fmicb.2024.1452564. eCollection 2024.

Uncovering Footprints of Natural Selection Through Spectral Analysis of Genomic Summary Statistics.

Mol Biol Evol. 2023 Jul 5;40(7). doi: 10.1093/molbev/msad157.

SNP4OrphanSpecies: A bioinformatics pipeline to isolate molecular markers for studying genetic diversity of orphan species.

Biodivers Data J. 2022 Aug 24;10:e85587. doi: 10.3897/BDJ.10.e85587. eCollection 2022.

A unique Toxoplasma gondii haplotype accompanied the global expansion of cats.

Nat Commun. 2022 Oct 1;13(1):5778. doi: 10.1038/s41467-022-33556-7.

The evolutionary patterns of barley pericentromeric chromosome regions, as shaped by linkage disequilibrium and domestication.

Plant J. 2022 Sep;111(6):1580-1594. doi: 10.1111/tpj.15908. Epub 2022 Aug 9.

Sequences to Differences in Gene Expression: Analysis of RNA-Seq Data.

Methods Mol Biol. 2022;2508:279-318. doi: 10.1007/978-1-0716-2376-3_20.

Generalizable characteristics of false-positive bacterial variant calls.

Microb Genom. 2021 Aug;7(8). doi: 10.1099/mgen.0.000615.

DEEPGEN-A Novel Variant Calling Assay for Low Frequency Variants.

Genes (Basel). 2021 Mar 30;12(4):507. doi: 10.3390/genes12040507.

Comparative Transcriptomics and RNA-Seq-Based Bulked Segregant Analysis Reveals Genomic Basis Underlying Virulence.

Front Microbiol. 2021 Feb 22;12:602812. doi: 10.3389/fmicb.2021.602812. eCollection 2021.

本文引用的文献

A chromosome-based draft sequence of the hexaploid bread wheat (Triticum aestivum) genome.

Science. 2014 Jul 18;345(6194):1251788. doi: 10.1126/science.1251788.

Toward better understanding of artifacts in variant calling from high-coverage samples.

Bioinformatics. 2014 Oct 15;30(20):2843-51. doi: 10.1093/bioinformatics/btu356. Epub 2014 Jun 27.

The impacts of read length and transcriptome complexity for de novo assembly: a simulation study.

PLoS One. 2014 Apr 15;9(4):e94825. doi: 10.1371/journal.pone.0094825. eCollection 2014.

Lacking alignments? The next-generation sequencing mapper segemehl revisited.

Bioinformatics. 2014 Jul 1;30(13):1837-43. doi: 10.1093/bioinformatics/btu146. Epub 2014 Mar 13.

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species.

Gigascience. 2013 Jul 22;2(1):10. doi: 10.1186/2047-217X-2-10.

The Norway spruce genome sequence and conifer genome evolution.

Nature. 2013 May 30;497(7451):579-84. doi: 10.1038/nature12211. Epub 2013 May 22.

Using false discovery rates to benchmark SNP-callers in next-generation sequencing projects.

Sci Rep. 2013;3:1512. doi: 10.1038/srep01512.

QUAST: quality assessment tool for genome assemblies.

Bioinformatics. 2013 Apr 15;29(8):1072-5. doi: 10.1093/bioinformatics/btt086. Epub 2013 Feb 19.

SNP Discovery through Next-Generation Sequencing and Its Applications.

Int J Plant Genomics. 2012;2012:831460. doi: 10.1155/2012/831460. Epub 2012 Nov 22.

A physical, genetic and functional sequence assembly of the barley genome.

Nature. 2012 Nov 29;491(7426):711-6. doi: 10.1038/nature11543. Epub 2012 Oct 17.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用来自小型真核生物基因组的模拟读数对单核苷酸多态性假阳性原因的调查。

An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献