Tang Jifeng, Vosman Ben, Voorrips Roeland E, van der Linden C Gerard, Leunissen Jack A M
Plant Research International, PO Box 16, 6700 AA Wageningen, The Netherlands.
BMC Bioinformatics. 2006 Oct 9;7:438. doi: 10.1186/1471-2105-7-438.
Single nucleotide polymorphisms (SNPs) are important tools in studying complex genetic traits and genome evolution. Computational strategies for SNP discovery make use of the large number of sequences present in public databases (in most cases as expressed sequence tags (ESTs)) and are considered to be faster and more cost-effective than experimental procedures. A major challenge in computational SNP discovery is distinguishing allelic variation from sequence variation between paralogous sequences, in addition to recognizing sequencing errors. For the majority of the public EST sequences, trace or quality files are lacking which makes detection of reliable SNPs even more difficult because it has to rely on sequence comparisons only.
We have developed a new algorithm to detect reliable SNPs and insertions/deletions (indels) in EST data, both with and without quality files. Implemented in a pipeline called QualitySNP, it uses three filters for the identification of reliable SNPs. Filter 1 screens for all potential SNPs and identifies variation between or within genotypes. Filter 2 is the core filter that uses a haplotype-based strategy to detect reliable SNPs. Clusters with potential paralogs as well as false SNPs caused by sequencing errors are identified. Filter 3 screens SNPs by calculating a confidence score, based upon sequence redundancy and quality. Non-synonymous SNPs are subsequently identified by detecting open reading frames of consensus sequences (contigs) with SNPs. The pipeline includes a data storage and retrieval system for haplotypes, SNPs and alignments. QualitySNP's versatility is demonstrated by the identification of SNPs in EST datasets from potato, chicken and humans.
QualitySNP is an efficient tool for SNP detection, storage and retrieval in diploid as well as polyploid species. It is available for running on Linux or UNIX systems. The program, test data, and user manual are available at http://www.bioinformatics.nl/tools/snpweb/ and as Additional files.
单核苷酸多态性(SNP)是研究复杂遗传性状和基因组进化的重要工具。用于SNP发现的计算策略利用公共数据库中存在的大量序列(在大多数情况下为表达序列标签(EST)),并且被认为比实验方法更快且更具成本效益。计算SNP发现中的一个主要挑战是,除了识别测序错误之外,还要区分等位基因变异与旁系同源序列之间的序列变异。对于大多数公共EST序列,缺少trace或质量文件,这使得可靠SNP的检测更加困难,因为它只能依赖于序列比较。
我们开发了一种新算法,用于在有或没有质量文件的情况下检测EST数据中的可靠SNP和插入/缺失(indel)。该算法在名为QualitySNP的流程中实现,它使用三个过滤器来识别可靠的SNP。过滤器1筛选所有潜在的SNP,并识别基因型之间或内部的变异。过滤器2是核心过滤器,它使用基于单倍型的策略来检测可靠的SNP。识别出具有潜在旁系同源物的簇以及由测序错误导致的假SNP。过滤器3通过基于序列冗余和质量计算置信度得分来筛选SNP。随后通过检测具有SNP的共有序列(重叠群)的开放阅读框来识别非同义SNP。该流程包括一个用于单倍型、SNP和比对的数据存储和检索系统。通过在马铃薯、鸡和人类的EST数据集中识别SNP,证明了QualitySNP的通用性。
QualitySNP是用于二倍体以及多倍体物种中SNP检测、存储和检索的高效工具。它可在Linux或UNIX系统上运行。该程序、测试数据和用户手册可在http://www.bioinformatics.nl/tools/snpweb/ 获得,并作为附加文件提供。