窗口掩码器:用于测序基因组的基于窗口的掩码器。
WindowMasker: window-based masker for sequenced genomes.
作者信息
Morgulis Aleksandr, Gertz E Michael, Schäffer Alejandro A, Agarwala Richa
机构信息
National Center for Biotechnology Information, National Institutes of Health, Department of Health and Human Services Building 38A, Room 1003N, 8600 Rockville Pike, Bethesda, MD 20894, USA.
出版信息
Bioinformatics. 2006 Jan 15;22(2):134-41. doi: 10.1093/bioinformatics/bti774. Epub 2005 Nov 15.
MOTIVATION
Matches to repetitive sequences are usually undesirable in the output of DNA database searches. Repetitive sequences need not be matched to a query, if they can be masked in the database. RepeatMasker/Maskeraid (RM), currently the most widely used software for DNA sequence masking, is slow and requires a library of repetitive template sequences, such as a manually curated RepBase library, that may not exist for newly sequenced genomes.
RESULTS
We have developed a software tool called WindowMasker (WM) that identifies and masks highly repetitive DNA sequences in a genome, using only the sequence of the genome itself. WM is orders of magnitude faster than RM because WM uses a few linear-time scans of the genome sequence, rather than local alignment methods that compare each library sequence with each piece of the genome. We validate WM by comparing BLAST outputs from large sets of queries applied to two versions of the same genome, one masked by WM, and the other masked by RM. Even for genomes such as the human genome, where a good RepBase library is available, searching the database as masked with WM yields more matches that are apparently non-repetitive and fewer matches to repetitive sequences. We show that these results hold for transcribed regions as well. WM also performs well on genomes for which much of the sequence was in draft form at the time of the analysis.
AVAILABILITY
WM is included in the NCBI C++ toolkit. The source code for the entire toolkit is available at ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/CURRENT/. Once the toolkit source is unpacked, the instructions for building WindowMasker application in the UNIX environment can be found in file src/app/winmasker/README.build.
SUPPLEMENTARY INFORMATION
Supplementary data are available at ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/windowmasker_suppl.pdf
动机
在DNA数据库搜索结果中,与重复序列的匹配通常是不理想的。如果重复序列能够在数据库中被屏蔽,那么就无需将其与查询序列进行匹配。RepeatMasker/Maskeraid(RM)是目前用于DNA序列屏蔽的最广泛使用的软件,它速度较慢,并且需要一个重复模板序列库,例如手动整理的RepBase库,而对于新测序的基因组可能并不存在这样的库。
结果
我们开发了一种名为WindowMasker(WM)的软件工具,它仅使用基因组自身的序列来识别和屏蔽基因组中的高度重复DNA序列。WM比RM快几个数量级,因为WM对基因组序列进行几次线性时间扫描,而不是使用将每个库序列与基因组的每一段进行比较的局部比对方法。我们通过比较应用于同一基因组的两个版本的大量查询的BLAST输出结果来验证WM,其中一个版本用WM进行屏蔽,另一个版本用RM进行屏蔽。即使对于像人类基因组这样有良好RepBase库的基因组,搜索用WM屏蔽的数据库会产生更多明显非重复的匹配结果,而与重复序列的匹配结果更少。我们表明这些结果在转录区域也成立。WM在分析时大部分序列处于草图形式的基因组上也表现良好。
可用性
WM包含在NCBI C++工具包中。整个工具包的源代码可在ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/CURRENT/获取。一旦解压工具包源代码,在UNIX环境中构建WindowMasker应用程序的说明可在文件src/app/winmasker/README.build中找到。
补充信息
补充数据可在ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/windowmasker_suppl.pdf获取。