New York Genome Center , New York, NY , USA.
Cold Spring Harbor Laboratory, Simons Center for Quantitative Biology, Cold Spring Harbor , New York, NY , USA.
Front Bioeng Biotechnol. 2015 Jan 26;3:8. doi: 10.3389/fbioe.2015.00008. eCollection 2015.
Repetitive sequences are abundant in the human genome. Different classes of repetitive DNA sequences, including simple repeats, tandem repeats, segmental duplications, interspersed repeats, and other elements, collectively span more than 50% of the genome. Because repeat sequences occur in the genome at different scales they can cause various types of sequence analysis errors, including in alignment, de novo assembly, and annotation, among others. This mini-review highlights the challenges introduced by small-scale repeat sequences, especially near-identical tandem or closely located repeats and short tandem repeats, for discovering DNA insertion and deletion (indel) mutations from next-generation sequencing data. We also discuss the de Bruijn graph sequence assembly paradigm that is emerging as the most popular and promising approach for detecting indels. The human exome is taken as an example and highlights how these repetitive elements can obscure or introduce errors while detecting these types of mutations.
重复序列在人类基因组中大量存在。不同类别的重复 DNA 序列,包括简单重复序列、串联重复序列、片段重复序列、散布重复序列和其他元件,总共跨越了基因组的 50%以上。由于重复序列在基因组中以不同的尺度存在,它们可能会导致各种类型的序列分析错误,包括比对、从头组装和注释等。这篇小综述强调了小尺度重复序列(尤其是近同源串联或紧密相邻重复序列和短串联重复序列)给从下一代测序数据中发现 DNA 插入和缺失(indel)突变带来的挑战。我们还讨论了 de Bruijn 图序列组装范例,它作为检测 indel 的最流行和最有前途的方法正在兴起。人类外显子组被用作一个例子,强调了这些重复元件在检测这些类型的突变时如何掩盖或引入错误。