Suppr超能文献

在大量序列集中同时识别长相似子串。

Simultaneous identification of long similar substrings in large sets of sequences.

作者信息

Kleffe Jürgen, Möller Friedrich, Wittig Burghardt

机构信息

Institut für Molekularbiologie und Bioinformatik, Charite-Universitätsmedizin Berlin, Berlin, Germany.

出版信息

BMC Bioinformatics. 2007 May 24;8 Suppl 5(Suppl 5):S7. doi: 10.1186/1471-2105-8-S5-S7.

Abstract

BACKGROUND

Sequence comparison faces new challenges today, with many complete genomes and large libraries of transcripts known. Gene annotation pipelines match these sequences in order to identify genes and their alternative splice forms. However, the software currently available cannot simultaneously compare sets of sequences as large as necessary especially if errors must be considered.

RESULTS

We therefore present a new algorithm for the identification of almost perfectly matching substrings in very large sets of sequences. Its implementation, called ClustDB, is considerably faster and can handle 16 times more data than VMATCH, the most memory efficient exact program known today. ClustDB simultaneously generates large sets of exactly matching substrings of a given minimum length as seeds for a novel method of match extension with errors. It generates alignments of maximum length with a considered maximum number of errors within each overlapping window of a given size. Such alignments are not optimal in the usual sense but faster to calculate and often more appropriate than traditional alignments for genomic sequence comparisons, EST and full-length cDNA matching, and genomic sequence assembly. The method is used to check the overlaps and to reveal possible assembly errors for 1377 Medicago truncatula BAC-size sequences published at http://www.medicago.org/genome/assembly_table.php?chr=1.

CONCLUSION

The program ClustDB proves that window alignment is an efficient way to find long sequence sections of homogenous alignment quality, as expected in case of random errors, and to detect systematic errors resulting from sequence contaminations. Such inserts are systematically overlooked in long alignments controlled by only tuning penalties for mismatches and gaps. ClustDB is freely available for academic use.

摘要

背景

如今序列比较面临新挑战,已知许多完整基因组和大量转录本库。基因注释流程会比对这些序列以识别基因及其可变剪接形式。然而,目前可用的软件无法同时比对所需的大序列集,尤其是在必须考虑错误的情况下。

结果

因此,我们提出了一种新算法,用于在非常大的序列集中识别几乎完全匹配的子串。其实现版本ClustDB速度大幅提升,相比目前已知内存效率最高的精确程序VMATCH,能处理的数据量多出16倍。ClustDB会同时生成大量给定最小长度的完全匹配子串,作为一种新的带错配扩展匹配方法的种子。它在给定大小的每个重叠窗口内生成最大长度且带有指定最大错配数的比对。这种比对并非传统意义上的最优比对,但计算速度更快,对于基因组序列比较、EST和全长cDNA匹配以及基因组序列组装而言,通常比传统比对更合适。该方法用于检查1377条截形苜蓿BAC大小序列(可在http://www.medicago.org/genome/assembly_table.php?chr=1获取)的重叠情况,并揭示可能的组装错误。

结论

程序ClustDB证明,窗口比对是一种有效的方法,可找到具有均匀比对质量的长序列片段,这在存在随机错误的情况下是可预期的,同时还能检测由序列污染导致的系统错误。在仅通过调整错配和空位罚分来控制的长比对中,此类插入错误会被系统性忽略。ClustDB可供学术免费使用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5a52/1892095/7b128676ed4b/1471-2105-8-S5-S7-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验