Gallo Juan Esteban, Muñoz José Fernando, Misas Elizabeth, McEwen Juan Guillermo, Clay Oliver Keatinge
Cellular & Molecular Biology Unit, Corporación para Investigaciones Biológicas, Medellín, Colombia; Doctoral Program in Biomedical Sciences, Universidad del Rosario, Bogotá, Colombia.
Cellular & Molecular Biology Unit, Corporación para Investigaciones Biológicas, Medellín, Colombia; Institute of Biology, Universidad de Antioquia, Medellín, Colombia.
Comput Biol Chem. 2014 Dec;53 Pt A:97-107. doi: 10.1016/j.compbiolchem.2014.08.014. Epub 2014 Aug 29.
Selecting the values of parameters used by de novo genomic assembly programs, or choosing an optimal de novo assembly from several runs obtained with different parameters or programs, are tasks that can require complex decision-making. A key parameter that must be supplied to typical next generation sequencing (NGS) assemblers is the k-mer length, i.e., the word size that determines which de Bruijn graph the program should map out and use. The topic of assembly selection criteria was recently revisited in the Assemblathon 2 study (Bradnam et al., 2013). Although no clear message was delivered with regard to optimal k-mer lengths, it was shown with examples that it is sometimes important to decide if one is most interested in optimizing the sequences of protein-coding genes (the gene space) or in optimizing the whole genome sequence including the intergenic DNA, as what is best for one criterion may not be best for the other. In the present study, our aim was to better understand how the assembly of unicellular fungi (which are typically intermediate in size and complexity between prokaryotes and metazoan eukaryotes) can change as one varies the k-mer values over a wide range. We used two different de novo assembly programs (SOAPdenovo2 and ABySS), and simple assembly metrics that also focused on success in assembling the gene space and repetitive elements. A recent increase in Illumina read length to around 150 bp allowed us to attempt de novo assemblies with a larger range of k-mers, up to 127 bp. We applied these methods to Illumina paired-end sequencing read sets of fungal strains of Paracoccidioides brasiliensis and other species. By visualizing the results in simple plots, we were able to track the effect of changing k-mer size and assembly program, and to demonstrate how such plots can readily reveal discontinuities or other unexpected characteristics that assembly programs can present in practice, especially when they are used in a traditional molecular microbiology laboratory with a 'genomics corner'. Here we propose and apply a component of a first pass validation methodology for benchmarking and understanding fungal genome de novo assembly processes.
选择从头基因组组装程序所使用的参数值,或者从使用不同参数或程序获得的多次运行结果中选择最优的从头组装结果,都是需要复杂决策的任务。必须提供给典型的新一代测序(NGS)组装器的一个关键参数是k-mer长度,即决定程序应该构建并使用哪个德布鲁因图的字长。组装选择标准这一主题最近在“组装马拉松2”研究(Bradnam等人,2013年)中被重新探讨。尽管关于最优k-mer长度没有给出明确的信息,但通过实例表明,有时决定是最关注优化蛋白质编码基因的序列(基因空间)还是优化包括基因间DNA在内的全基因组序列很重要,因为对一个标准最有利的可能对另一个标准并非最有利。在本研究中,我们的目的是更好地理解单细胞真菌(其大小和复杂性通常介于原核生物和后生动物真核生物之间)的组装如何随着k-mer值在较宽范围内变化而改变。我们使用了两种不同的从头组装程序(SOAPdenovo2和ABySS),以及同样侧重于基因空间和重复元件组装成功情况的简单组装指标。最近Illumina读长增加到约150 bp,使我们能够尝试使用更大范围的k-mer进行从头组装,最大可达127 bp。我们将这些方法应用于巴西副球孢子菌和其他物种的真菌菌株的Illumina双端测序读集。通过在简单图表中可视化结果,我们能够追踪改变k-mer大小和组装程序的影响,并展示这样的图表如何能够轻易揭示组装程序在实际应用中可能呈现的不连续性或其他意外特征,特别是当它们在设有“基因组角落”的传统分子微生物学实验室中使用时。在这里,我们提出并应用了一种初步验证方法的组成部分,用于对真菌基因组从头组装过程进行基准测试和理解。