Parsons J D
Genome Sequencing Center, Washington University School of Medicine, St Louis, MO 63108, USA.
Comput Appl Biosci. 1995 Dec;11(6):603-13. doi: 10.1093/bioinformatics/11.6.603.
DNA sequence clustering is an effective aid of the comprehension, summarization and compression of DNA sequence databases. Previous work created programs suitable for the comparison and clustering of cDNA sequences but new enhanced programs have been written to cluster genomic DNA fragments, large EST projects, and entire DNA databases. Three new programs (ICAtools) are discussed: ICAass, N2tool, and ICAmatches. ICAass has been used to compress the EMBL database by hiding or removing sequences with various degrees of redundancy. It also has the fastest database querying mode. N2tool provides fast and sensitive clustering of genomic fragment databases on the basis of small areas of local similarity. N2tool has proven utility in the discovery of contaminating vector or other artefactual sequence when the potential contaminant is not otherwise known. ICAmatches is a new cluster analysis program that uses a novel alignment style to present multiple alignment summaries. All the tools are convenient to use because they share a common memory-frugal index format and accept most DNA sequence formats directly.
DNA序列聚类是理解、总结和压缩DNA序列数据库的有效辅助手段。先前的工作创建了适用于cDNA序列比较和聚类的程序,但现在已经编写了新的增强程序来对基因组DNA片段、大型EST项目和整个DNA数据库进行聚类。本文讨论了三个新程序(ICAtools):ICAass、N2tool和ICAmatches。ICAass通过隐藏或去除具有不同程度冗余的序列来压缩EMBL数据库。它还具有最快的数据库查询模式。N2tool基于局部相似性的小区域对基因组片段数据库进行快速且灵敏的聚类。当潜在污染物未知时,N2tool已被证明在发现污染载体或其他人为序列方面很有用。ICAmatches是一个新的聚类分析程序,它使用一种新颖的比对方式来呈现多重比对总结。所有这些工具都便于使用,因为它们共享一种节省内存的通用索引格式,并且直接接受大多数DNA序列格式。