使用LifeTrace进行碱基识别

Basecalling with LifeTrace.

作者信息

Walther D, Bartha G, Morris M

机构信息

Incyte Genomics, Inc., Palo Alto, California 94304, USA.

出版信息

Genome Res. 2001 May;11(5):875-88. doi: 10.1101/gr.177901.

DOI:10.1101/gr.177901

PMID:11337481

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC311100/

Abstract

A pivotal step in electrophoresis sequencing is the conversion of the raw, continuous chromatogram data into the actual sequence of discrete nucleotides, a process referred to as basecalling. We describe a novel algorithm for basecalling implemented in the program LifeTrace. Like Phred, currently the most widely used basecalling software program, LifeTrace takes processed trace data as input. It was designed to be tolerant to variable peak spacing by means of an improved peak-detection algorithm that emphasizes local chromatogram information over global properties. LifeTrace is shown to generate high-quality basecalls and reliable quality scores. It proved particularly effective when applied to MegaBACE capillary sequencing machines. In a benchmark test of 8372 dye-primer MegaBACE chromatograms, LifeTrace generated 17% fewer substitution errors, 16% fewer insertion/deletion errors, and 2.4% more aligned bases to the finished sequence than did Phred. For two sets totaling 6624 dye-terminator chromatograms, the performance improvement was 15% fewer substitution errors, 10% fewer insertion/deletion errors, and 2.1% more aligned bases. The processing time required by LifeTrace is comparable to that of Phred. The predicted quality scores were in line with observed quality scores, permitting direct use for quality clipping and in silico single nucleotide polymorphism (SNP) detection. Furthermore, we introduce a new type of quality score associated with every basecall: the gap-quality. It estimates the probability of a deletion error between the current and the following basecall. This additional quality score improves detection of single basepair deletions when used for locating potential basecalling errors during the alignment. We also describe a new protocol for benchmarking that we believe better discerns basecaller performance differences than methods previously published.

摘要

电泳测序中的关键步骤是将原始的连续色谱图数据转换为离散核苷酸的实际序列，这一过程称为碱基识别。我们描述了一种在LifeTrace程序中实现的用于碱基识别的新算法。与目前使用最广泛的碱基识别软件程序Phred一样，LifeTrace将处理后的痕量数据作为输入。它通过一种改进的峰检测算法设计为能够容忍可变的峰间距，该算法更强调局部色谱图信息而非全局特性。结果表明，LifeTrace能生成高质量的碱基识别结果和可靠的质量得分。在应用于MegaBACE毛细管测序仪时，它被证明特别有效。在对8372个染料引物MegaBACE色谱图的基准测试中，与Phred相比，LifeTrace产生的替换错误减少了17%，插入/缺失错误减少了16%，与完成序列的比对碱基增加了2.4%。对于总共6624个染料终止剂色谱图的两组数据，性能提升为替换错误减少15%，插入/缺失错误减少10%，比对碱基增加2.1%。LifeTrace所需的处理时间与Phred相当。预测的质量得分与观察到的质量得分一致，可直接用于质量剪切和计算机单核苷酸多态性（SNP）检测。此外，我们引入了一种与每个碱基识别相关的新型质量得分：间隙质量。它估计当前碱基识别与下一个碱基识别之间发生缺失错误的概率。当用于在比对过程中定位潜在的碱基识别错误时，这种额外的质量得分可改善单碱基对缺失的检测。我们还描述了一种新的基准测试方案，我们认为它比以前发表的方法能更好地辨别碱基识别器的性能差异。