Suppr超能文献

使用短读长和长读长的混合从头串联重复序列检测

Hybrid de novo tandem repeat detection using short and long reads.

作者信息

Fertin Guillaume, Jean Géraldine, Radulescu Andreea, Rusu Irena

出版信息

BMC Med Genomics. 2015;8 Suppl 3(Suppl 3):S5. doi: 10.1186/1755-8794-8-S3-S5. Epub 2015 Sep 23.

Abstract

BACKGROUND

As one of the most studied genome rearrangements, tandem repeats have a considerable impact on genetic backgrounds of inherited diseases. Many methods designed for tandem repeat detection on reference sequences obtain high quality results. However, in the case of a de novo context, where no reference sequence is available, tandem repeat detection remains a difficult problem. The short reads obtained with the second-generation sequencing methods are not long enough to span regions that contain long repeats. This length limitation was tackled by the long reads obtained with the third-generation sequencing platforms such as Pacific Biosciences technologies. Nevertheless, the gain on the read length came with a significant increase of the error rate. The main objective of nowadays studies on long reads is to handle the high error rate up to 16%.

METHODS

In this paper we present MixTaR, the first de novo method for tandem repeat detection that combines the high-quality of short reads and the large length of long reads. Our hybrid algorithm uses the set of short reads for tandem repeat pattern detection based on a de Bruijn graph. These patterns are then validated using the long reads, and the tandem repeat sequences are constructed using local greedy assemblies.

RESULTS

MixTaR is tested with both simulated and real reads from complex organisms. For a complete analysis of its robustness to errors, we use short and long reads with different error rates. The results are then analysed in terms of number of tandem repeats detected and the length of their patterns.

CONCLUSIONS

Our method shows high precision and sensitivity. With low false positive rates even for highly erroneous reads, MixTaR is able to detect accurate tandem repeats with pattern lengths varying within a significant interval.

摘要

背景

串联重复序列作为研究最多的基因组重排之一,对遗传性疾病的遗传背景有相当大的影响。许多针对参考序列上串联重复序列检测设计的方法都能获得高质量的结果。然而,在没有参考序列的从头背景下,串联重复序列检测仍然是一个难题。第二代测序方法获得的短读长不足以跨越包含长重复序列的区域。第三代测序平台(如太平洋生物科学公司的技术)获得的长读长解决了这一长度限制问题。然而,读长的增加伴随着错误率的显著上升。当今长读长研究的主要目标是处理高达16%的高错误率。

方法

在本文中,我们提出了MixTaR,这是第一种用于串联重复序列检测的从头方法,它结合了短读长的高质量和长读长的大长度。我们的混合算法基于德布鲁因图使用短读长集进行串联重复序列模式检测。然后使用长读长验证这些模式,并使用局部贪婪组装构建串联重复序列。

结果

MixTaR使用来自复杂生物体的模拟和真实读长进行了测试。为了全面分析其对错误的鲁棒性,我们使用了具有不同错误率的短读长和长读长。然后根据检测到的串联重复序列数量及其模式长度对结果进行分析。

结论

我们的方法显示出高精度和高灵敏度。即使对于高度错误的读长,MixTaR的误报率也很低,能够检测出模式长度在很大区间内变化的准确串联重复序列。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/982f/4582210/fb80cbd75ed0/1755-8794-8-S3-S5-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验