Malde Ketil, Schneeberger Korbinian, Coward Eivind, Jonassen Inge
Computational Biology Unit, Bergen Centre for Computational Sciences, University of Bergen, Norway.
Bioinformatics. 2006 Sep 15;22(18):2232-6. doi: 10.1093/bioinformatics/btl368. Epub 2006 Jul 12.
Repeat sequences in ESTs are a source of problems, in particular for clustering. ESTs are therefore commonly masked against a library of known repeats. High quality repeat libraries are available for the widely studied organisms, but for most other organisms the lack of such libraries is likely to compromise the quality of EST analysis.
We present a fast, flexible and library-less method for masking repeats in EST sequences, based on match statistics within the EST collection. The method is not linked to a particular clustering algorithm. Extensive testing on datasets using different clustering methods and a genomic mapping as reference shows that this method gives results that are better than or as good as those obtained using RepeatMasker with a repeat library.
The implementation of RBR is available under the terms of the GPL from http://www.ii.uib.no/~ketil/bioinformatics
Supplementary data are available at Bioinformatics online.
EST(表达序列标签)中的重复序列是问题的一个来源,尤其是在聚类方面。因此,EST通常会针对已知重复序列库进行屏蔽。对于广泛研究的生物,有高质量的重复序列库可用,但对于大多数其他生物而言,缺乏此类库可能会影响EST分析的质量。
我们提出了一种基于EST集合内匹配统计信息的快速、灵活且无需库的方法,用于屏蔽EST序列中的重复序列。该方法与特定的聚类算法无关。使用不同聚类方法并以基因组图谱作为参考对数据集进行的广泛测试表明,此方法给出的结果优于或等同于使用带有重复序列库的RepeatMasker所获得的结果。
RBR的实现可根据GPL条款从http://www.ii.uib.no/~ketil/bioinformatics获取。
补充数据可在《生物信息学》在线获取。