清理程序（CLEANUP）：一款用于去除核苷酸序列数据库冗余信息的快速计算机程序。

CLEANUP: a fast computer program for removing redundancies from nucleotide sequence databases.

作者信息

Grillo G, Attimonelli M, Liuni S, Pesole G

机构信息

Centro di Studio sui Mitocondri e Metabolismo Energetico, CNR, Italy.

出版信息

Comput Appl Biosci. 1996 Feb;12(1):1-8. doi: 10.1093/bioinformatics/12.1.1.

DOI:10.1093/bioinformatics/12.1.1

PMID:8670613

Abstract

A key concept in comparing sequence collections is the issue of redundancy. The production of sequence collections free from redundancy is undoubtedly very useful, both in performing statistical analyses and accelerating extensive database searching on nucleotide sequences. Indeed, publicly available databases contain multiple entries of identical or almost identical sequences. Performing statistical analysis on such biased data makes the risk of assigning high significance to non-significant patterns very high. In order to carry out unbiased statistical analysis as well as more efficient database searching it is thus necessary to analyse sequence data that have been purged of redundancy. Given that a unambiguous definition of redundancy is impracticable for biological sequence data, in the present program a quantitative description of redundancy will be used, based on the measure of sequence similarity. A sequence is considered redundant if it shows a degree of similarity and overlapping with a longer sequence in the database greater than a threshold fixed by the user. In this paper we present a new algorithm based on an "approximate string matching' procedure, which is able to determine the overall degree of similarity between each pair of sequences contained in a nucleotide sequence database and to generate automatically nucleotide sequence collections free from redundancies.

摘要

比较序列集的一个关键概念是冗余问题。生成无冗余的序列集在进行统计分析和加速对核苷酸序列的广泛数据库搜索方面无疑都非常有用。实际上，公开可用的数据库包含多个相同或几乎相同序列的条目。对这种有偏差的数据进行统计分析会使将高显著性赋予无显著性模式的风险非常高。因此，为了进行无偏差的统计分析以及更高效的数据库搜索，有必要分析已去除冗余的序列数据。鉴于对生物序列数据来说，冗余的明确定义是不切实际的，在本程序中，将基于序列相似性度量对冗余进行定量描述。如果一个序列与数据库中一个较长序列的相似程度和重叠程度大于用户设定的阈值，则该序列被视为冗余。在本文中，我们提出了一种基于“近似字符串匹配”程序的新算法，该算法能够确定核苷酸序列数据库中每对序列之间的总体相似程度，并自动生成无冗余的核苷酸序列集。