Department of Biology, The Pennsylvania State University, State College, PA 16802, USA.
Center for Medical Genomics, The Pennsylvania State University, State College, PA 16802, USA.
Bioinformatics. 2019 Nov 1;35(22):4809-4811. doi: 10.1093/bioinformatics/btz484.
Tandem DNA repeats can be sequenced with long-read technologies, but cannot be accurately deciphered due to the lack of computational tools taking high error rates of these technologies into account. Here we introduce Noise-Cancelling Repeat Finder (NCRF) to uncover putative tandem repeats of specified motifs in noisy long reads produced by Pacific Biosciences and Oxford Nanopore sequencers. Using simulations, we validated the use of NCRF to locate tandem repeats with motifs of various lengths and demonstrated its superior performance as compared to two alternative tools. Using real human whole-genome sequencing data, NCRF identified long arrays of the (AATGG)n repeat involved in heat shock stress response.
NCRF is implemented in C, supported by several python scripts, and is available in bioconda and at https://github.com/makovalab-psu/NoiseCancellingRepeatFinder.
Supplementary data are available at Bioinformatics online.
串联 DNA 重复序列可以使用长读长技术进行测序,但由于缺乏考虑这些技术高错误率的计算工具,因此无法准确破译。在这里,我们介绍了噪声消除重复序列发现工具(Noise-Cancelling Repeat Finder,NCRF),用于在 Pacific Biosciences 和 Oxford Nanopore 测序器产生的嘈杂长读段中发现指定基序的假定串联重复序列。通过模拟,我们验证了 NCRF 用于定位具有各种长度基序的串联重复序列的用途,并证明了它与两种替代工具相比具有更好的性能。使用真实的人类全基因组测序数据,NCRF 鉴定了与热休克应激反应相关的(AATGG)n 重复长阵列。
NCRF 是用 C 语言实现的,支持几个 Python 脚本,并在 bioconda 和 https://github.com/makovalab-psu/NoiseCancellingRepeatFinder 上提供。
补充数据可在生物信息学在线获得。