Department of Ecology & Evolutionary Biology, Yale University, New Haven, Connecticut 06520, USA.
BMC Genomics. 2013 Aug 8;14:537. doi: 10.1186/1471-2164-14-537.
The numerous classes of repeats often impede the assembly of genome sequences from the short reads provided by new sequencing technologies. We demonstrate a simple and rapid means to ascertain the repeat structure and total size of a bacterial or archaeal genome without the need for assembly by directly analyzing the abundances of distinct k-mers among reads.
The sensitivity of this procedure to resolve variation within a bacterial species is demonstrated: genome sizes and repeat structure of five environmental strains of E. coli from short Illumina reads were estimated by this method, and total genome sizes corresponded well with those obtained for the same strains by pulsed-field gel electrophoresis. In addition, this approach was applied to read-sets for completed genomes and shown to be accurate over a wide range of microbial genome sizes.
Application of these procedures, based solely on k-mer abundances in short read data sets, allows aspects of genome structure to be resolved that are not apparent from conventional short read assemblies. This knowledge of the repetitive content of genomes provides insights into genome evolution and diversity.
新测序技术提供的短读长常常会妨碍重复序列众多的基因组序列的组装。我们展示了一种简单而快速的方法,通过直接分析读段中不同 k- mers 的丰度,无需组装即可确定细菌或古菌基因组的重复结构和总大小。
该方法对解析细菌种内变异的灵敏度进行了验证:通过该方法估算了五个环境来源的大肠杆菌菌株的基因组大小和重复结构,通过脉冲场凝胶电泳获得的相同菌株的总基因组大小与该方法非常吻合。此外,该方法还应用于已完成基因组的读段集,并在广泛的微生物基因组大小范围内表现出准确性。
仅基于短读长数据集的 k- mers 丰度应用这些程序,可以解析出常规短读长组装中不明显的基因组结构方面。对基因组重复内容的了解提供了对基因组进化和多样性的深入认识。