基因组序列错误的自动校正

Automated correction of genome sequence errors.

作者信息

Gajer Pawel, Schatz Michael, Salzberg Steven L

机构信息

The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA.

出版信息

Nucleic Acids Res. 2004 Jan 26;32(2):562-9. doi: 10.1093/nar/gkh216. Print 2004.

DOI:10.1093/nar/gkh216

PMID:14744981

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC373340/

Abstract

By using information from an assembly of a genome, a new program called AutoEditor significantly improves base calling accuracy over that achieved by previous algorithms. This in turn improves the overall accuracy of genome sequences and facilitates the use of these sequences for polymorphism discovery. We describe the algorithm and its application in a large set of recent genome sequencing projects. The number of erroneous base calls in these projects was reduced by 80%. In an analysis of over one million corrections, we found that AutoEditor made just one error per 8828 corrections. By substantially increasing the accuracy of base calling, AutoEditor can dramatically accelerate the process of finishing genomes, which involves closing all gaps and ensuring minimum quality standards for the final sequence. It also greatly improves our ability to discover single nucleotide polymorphisms (SNPs) between closely related strains and isolates of the same species.

摘要

通过使用基因组组装的信息，一个名为AutoEditor的新程序显著提高了碱基识别准确性，超过了之前算法所达到的水平。这进而提高了基因组序列的整体准确性，并促进了这些序列在多态性发现中的应用。我们描述了该算法及其在大量近期基因组测序项目中的应用。这些项目中错误碱基识别的数量减少了80%。在对超过一百万次校正的分析中，我们发现AutoEditor每8828次校正仅出现一次错误。通过大幅提高碱基识别的准确性，AutoEditor可以显著加速完成基因组的过程，这包括填补所有缺口并确保最终序列的最低质量标准。它还极大地提高了我们发现同一物种密切相关菌株和分离株之间单核苷酸多态性（SNP）的能力。

相似文献

Automated correction of genome sequence errors.基因组序列错误的自动校正

Nucleic Acids Res. 2004 Jan 26;32(2):562-9. doi: 10.1093/nar/gkh216. Print 2004.

Generation of SNP datasets for orangutan population genomics using improved reduced-representation sequencing and direct comparisons of SNP calling algorithms.利用改良的简化代表性测序和 SNP 调用算法的直接比较，生成猩猩群体基因组学的 SNP 数据集。

BMC Genomics. 2014 Jan 10;15:16. doi: 10.1186/1471-2164-15-16.

Assessing batch effects of genotype calling algorithm BRLMM for the Affymetrix GeneChip Human Mapping 500 K array set using 270 HapMap samples.使用270个HapMap样本评估基因分型算法BRLMM对Affymetrix GeneChip Human Mapping 500 K芯片组的批次效应。

BMC Bioinformatics. 2008 Aug 12;9 Suppl 9(Suppl 9):S17. doi: 10.1186/1471-2105-9-S9-S17.

Reference-free SNP calling: improved accuracy by preventing incorrect calls from repetitive genomic regions.无参考 SNP 调用：通过防止重复基因组区域的错误调用来提高准确性。

Biol Direct. 2012 Jun 8;7:17. doi: 10.1186/1745-6150-7-17.

PhredEM: a phred-score-informed genotype-calling approach for next-generation sequencing studies.PhredEM：一种用于下一代测序研究的基于Phred分数的基因型分型方法。

Genet Epidemiol. 2017 Jul;41(5):375-387. doi: 10.1002/gepi.22048. Epub 2017 May 31.

TotalReCaller: improved accuracy and performance via integrated alignment and base-calling.TotalReCaller：通过集成的对准和碱基调用提高准确性和性能。

Bioinformatics. 2011 Sep 1;27(17):2330-7. doi: 10.1093/bioinformatics/btr393. Epub 2011 Jun 30.

A semi-automated system for analysis and storage of SNPs.一种用于单核苷酸多态性（SNP）分析和存储的半自动系统。

Hum Mutat. 2001 Apr;17(4):243-54. doi: 10.1002/humu.20.

Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism-calling pipelines.基因组多样性影响细菌单核苷酸多态性 calling 管道的准确性。

Gigascience. 2020 Feb 1;9(2). doi: 10.1093/gigascience/giaa007.

An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome.利用来自小型真核生物基因组的模拟读数对单核苷酸多态性假阳性原因的调查。

BMC Bioinformatics. 2015 Nov 11;16:382. doi: 10.1186/s12859-015-0801-z.

Using 2k + 2 bubble searches to find single nucleotide polymorphisms in k-mer graphs.使用2k + 2次冒泡搜索在k-mer图中查找单核苷酸多态性。

Bioinformatics. 2015 Mar 1;31(5):642-6. doi: 10.1093/bioinformatics/btu706. Epub 2014 Oct 24.

引用本文的文献

On the nature and types of anomalies: a review of deviations in data.论异常的性质与类型：数据偏差综述

Int J Data Sci Anal. 2021;12(4):297-331. doi: 10.1007/s41060-021-00265-1. Epub 2021 Aug 4.

Comparative genomic analysis of eutherian fibroblast growth factor genes.真兽类成纤维细胞生长因子基因的比较基因组分析。

BMC Genomics. 2020 Aug 5;21(1):542. doi: 10.1186/s12864-020-06958-4.

Comparative genomic analysis of eutherian adiponectin genes.真兽亚纲脂联素基因的比较基因组分析

Heliyon. 2018 Jun 6;4(6):e00647. doi: 10.1016/j.heliyon.2018.e00647. eCollection 2018 Jun.

Comparative genomic analysis of eutherian kallikrein genes.真兽类激肽释放酶基因的比较基因组分析。

Mol Genet Metab Rep. 2017 Feb 3;10:96-99. doi: 10.1016/j.ymgmr.2017.01.009. eCollection 2017 Mar.

Comparative genomic analysis of eutherian tumor necrosis factor ligand genes.真兽亚纲肿瘤坏死因子配体基因的比较基因组分析

Immunogenetics. 2016 Feb;68(2):125-32. doi: 10.1007/s00251-015-0887-5. Epub 2015 Dec 9.

Population-Sequencing as a Biomarker for Sample Characterization.群体测序作为样本特征分析的生物标志物

J Biomark. 2013;2013:861823. doi: 10.1155/2013/861823. Epub 2013 Dec 8.

Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction.去噪DNA深度测序数据——高通量测序错误及其校正

Brief Bioinform. 2016 Jan;17(1):154-79. doi: 10.1093/bib/bbv029. Epub 2015 May 29.

Base-calling algorithm with vocabulary (BCV) method for analyzing population sequencing chromatograms.基于词汇的碱基调用算法（BCV）在群体测序图谱分析中的应用。

PLoS One. 2013;8(1):e54835. doi: 10.1371/journal.pone.0054835. Epub 2013 Jan 28.

A post-assembly genome-improvement toolkit (PAGIT) to obtain annotated genomes from contigs.一种用于从连续片段中获得带注释基因组的后组装基因组改进工具包（PAGIT）。

Nat Protoc. 2012 Jun 7;7(7):1260-84. doi: 10.1038/nprot.2012.068.

Biological agent detection technologies.生物制剂检测技术。

Mol Ecol Resour. 2009 May;9 Suppl s1(Suppl 1):51-7. doi: 10.1111/j.1755-0998.2009.02632.x.

本文引用的文献

Correcting errors in shotgun sequences.校正鸟枪法测序中的错误。

Nucleic Acids Res. 2003 Aug 1;31(15):4663-72. doi: 10.1093/nar/gkg653;.

Comparative genome sequencing for discovery of novel polymorphisms in Bacillus anthracis.用于发现炭疽芽孢杆菌新多态性的比较基因组测序

Science. 2002 Jun 14;296(5575):2028-33. doi: 10.1126/science.1071837. Epub 2002 May 9.

ARACHNE: a whole-genome shotgun assembler.ARACHNE：一种全基因组鸟枪法测序序列拼接程序。

Genome Res. 2002 Jan;12(1):177-89. doi: 10.1101/gr.208902.

Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21.通过对人类21号染色体进行高分辨率扫描揭示的有限单倍型多样性区域

Science. 2001 Nov 23;294(5547):1719-23. doi: 10.1126/science.1065573.

An Eulerian path approach to DNA fragment assembly.一种用于DNA片段组装的欧拉路径方法。

Proc Natl Acad Sci U S A. 2001 Aug 14;98(17):9748-53. doi: 10.1073/pnas.171285098.

Single nucleotide polymorphisms in Mycobacterium tuberculosis structural genes.结核分枝杆菌结构基因中的单核苷酸多态性

Emerg Infect Dis. 2001 May-Jun;7(3):486-8. doi: 10.3201/eid0703.010334.

Variation is the spice of life.变化是生活的调味品。

Nat Genet. 2001 Mar;27(3):234-6. doi: 10.1038/85776.

A whole-genome assembly of Drosophila.果蝇的全基因组组装

Science. 2000 Mar 24;287(5461):2196-204. doi: 10.1126/science.287.5461.2196.

Base-calling of automated sequencer traces using phred. II. Error probabilities.使用Phred对自动测序仪追踪结果进行碱基识别。II. 错误概率。

Genome Res. 1998 Mar;8(3):186-94.

Base-calling of automated sequencer traces using phred. I. Accuracy assessment.使用Phred对自动测序仪轨迹进行碱基识别。I. 准确性评估。

Genome Res. 1998 Mar;8(3):175-85. doi: 10.1101/gr.8.3.175.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验