Suppr超能文献

PTPan--克服引物/探针设计中寡核苷酸序列匹配的记忆限制。

PTPan--overcoming memory limitations in oligonucleotide string matching for primer/probe design.

机构信息

Department of Informatics, Technische Universität München, Boltzmannstrasse 3, 85748 Garching, Germany.

出版信息

Bioinformatics. 2011 Oct 15;27(20):2797-805. doi: 10.1093/bioinformatics/btr483. Epub 2011 Aug 19.

Abstract

MOTIVATION

Nucleic acid diagnostics has high demands for non-heuristic exact and approximate oligonucleotide string matching concerning in silico primer/probe design in huge nucleic acid sequence collections. Unfortunately, public sequence repositories grow much faster than computer hardware performance and main memory capacity do. This growth imposes severe problems on existing oligonucleotide primer/probe design applications necessitating new approaches based on space-efficient indexing structures.

RESULTS

We developed PTPan (spoken Peter Pan, 'PT' is for Position Tree, the earlier name of suffix trees), a space-efficient indexing structure for approximate oligonucleotide string matching in nucleic acid sequence data. Based on suffix trees, it combines partitioning, truncation and a new suffix tree stream compression to deal with large amounts of aligned and unaligned data. PTPan operates efficiently in main memory and on secondary storage, balancing between memory consumption and runtime during construction and application. Based on PTPan, applications supporting similarity search and primer/probe design have been implemented, namely FindFamily, ProbeMatch and ProbeDesign. All three use a weighted Levenshtein distance metric for approximative queries to find and rate matches with indels as well as substitutions. We integrated PTPan in the worldwide used software package ARB to demonstrate usability and performance. Comparing PTPan and the original ARB index for the very large ssu-rRNA database SILVA, we recognized a shorter construction time, extended functionality and dramatically reduced memory requirements at the price of expanded, but very reasonable query times. PTPan enables indexing of huge nucleic acid sequence collections at reasonable application response times. Not being limited by main memory, PTPan constitutes a major advancement regarding rapid oligonucleotide string matching in primer/probe design now and in the future facing the enormous growth of molecular sequence data.

AVAILABILITY

Supplementary Material, PTPan stand-alone library and ARB-PTPan binary on http://ptpan.lrr.in.tum.de/.

CONTACT

meierh@in.tum.de

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

核酸诊断在计算机引物/探针设计中对非启发式精确和近似寡核苷酸字符串匹配具有很高的要求,特别是在大规模核酸序列集合中。不幸的是,公共序列存储库的增长速度远远超过计算机硬件性能和主内存容量的增长速度。这种增长给现有的寡核苷酸引物/探针设计应用程序带来了严重的问题,需要基于节省空间的索引结构的新方法。

结果

我们开发了 PTPan(发音为 Peter Pan,'PT' 是 Position Tree 的缩写,是后缀树的早期名称),这是一种用于核酸序列数据中近似寡核苷酸字符串匹配的节省空间的索引结构。它基于后缀树,结合了分区、截断和新的后缀树流压缩技术,以处理大量对齐和未对齐的数据。PTPan 在主内存和辅助存储中都能高效运行,在构建和应用过程中在内存消耗和运行时之间取得平衡。基于 PTPan,我们实现了支持相似性搜索和引物/探针设计的应用程序,即 FindFamily、ProbeMatch 和 ProbeDesign。这三个应用程序都使用加权的 Levenshtein 距离度量来进行近似查询,以找到并对带有插入和替换的匹配进行评分。我们将 PTPan 集成到全球使用的软件包 ARB 中,以展示其可用性和性能。通过比较 PTPan 和原始 ARB 索引在非常大的 ssu-rRNA 数据库 SILVA 上的表现,我们发现构建时间更短,功能更扩展,而查询时间也略有增加,但非常合理,同时还大大减少了内存需求。PTPan 使大规模核酸序列集合的索引在合理的应用程序响应时间内成为可能。由于不受主内存的限制,PTPan 是在面对分子序列数据的巨大增长时,在引物/探针设计中的快速寡核苷酸字符串匹配方面的一个重大进展。

可用性

补充材料、PTPan 独立库和 ARB-PTPan 二进制文件可在 http://ptpan.lrr.in.tum.de/ 上获得。

联系方式

meierh@in.tum.de

补充信息

补充数据可在 Bioinformatics 在线获得。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验