系统聚类序列数据集上寡核苷酸特征的全面宽松搜索。

Comprehensive and relaxed search for oligonucleotide signatures in hierarchically clustered sequence datasets.

机构信息

Services Department of Informatics, Technische Universität München, Boltzmannstrasse 3, 85748 Garching, Germany.

出版信息

Bioinformatics. 2011 Jun 1;27(11):1546-54. doi: 10.1093/bioinformatics/btr161. Epub 2011 Apr 5.

DOI:10.1093/bioinformatics/btr161

PMID:21471017

Abstract

MOTIVATION

PCR, hybridization, DNA sequencing and other important methods in molecular diagnostics rely on both sequence-specific and sequence group-specific oligonucleotide primers and probes. Their design depends on the identification of oligonucleotide signatures in whole genome or marker gene sequences. Although genome and gene databases are generally available and regularly updated, collections of valuable signatures are rare. Even for single requests, the search for signatures becomes computationally expensive when working with large collections of target (and non-target) sequences. Moreover, with growing dataset sizes, the chance of finding exact group-matching signatures decreases, necessitating the application of relaxed search methods. The resultant substantial increase in complexity is exacerbated by the dearth of algorithms able to solve these problems efficiently.

RESULTS

We have developed CaSSiS, a fast and scalable method for computing comprehensive collections of sequence- and sequence group-specific oligonucleotide signatures from large sets of hierarchically clustered nucleic acid sequence data. Based on the ARB Positional Tree (PT-)Server and a newly developed BGRT data structure, CaSSiS not only determines sequence-specific signatures and perfect group-covering signatures for every node within the cluster (i.e. target groups), but also signatures with maximal group coverage (sensitivity) within a user-defined range of non-target hits (specificity) for groups lacking a perfect common signature. An upper limit of tolerated mismatches within the target group, as well as the minimum number of mismatches with non-target sequences, can be predefined. Test runs with one of the largest phylogenetic gene sequence datasets available indicate good runtime and memory performance, and in silico spot tests have shown the usefulness of the resulting signature sequences as blueprints for group-specific oligonucleotide probes.

AVAILABILITY

Software and Supplementary Material are available at http://cassis.in.tum.de/.

摘要

动机

PCR、杂交、DNA 测序和其他分子诊断中的重要方法都依赖于序列特异性和序列组特异性寡核苷酸引物和探针。它们的设计取决于整个基因组或标记基因序列中寡核苷酸特征的识别。虽然基因组和基因数据库通常是可用的，并定期更新，但有价值的特征集合却很少。即使是单个请求，在处理大量目标（和非目标）序列时，特征的搜索也会变得计算昂贵。此外，随着数据集规模的增长，找到完全匹配的组特征的机会减少，需要应用宽松的搜索方法。由于缺乏能够有效解决这些问题的算法，因此复杂性会大大增加。

结果

我们开发了 CaSSiS，这是一种从大型层次聚类核酸序列数据集中计算综合的序列和序列组特异性寡核苷酸特征集合的快速且可扩展的方法。基于 ARB 位置树（PT-）服务器和新开发的 BGRT 数据结构，CaSSiS 不仅确定了聚类内每个节点的序列特异性特征和完美的组覆盖特征（即目标组），而且还确定了在用户定义的非目标命中（特异性）范围内具有最大组覆盖（敏感性）的特征，对于缺乏完美公共特征的组。可以预定义目标组内允许的最大错配数以及与非目标序列的最小错配数。使用可用的最大系统发育基因序列数据集之一进行的测试运行表明了良好的运行时和内存性能，并且在计算机模拟点测试中表明了生成的特征序列作为组特异性寡核苷酸探针的蓝图的有用性。