Kucuk Erdi, Chu Justin, Vandervalk Benjamin P, Hammond S Austin, Warren René L, Birol Inanc
University of British Columbia, Vancouver, BC, Canada.
Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada.
Bioinformatics. 2017 Jun 15;33(12):1782-1788. doi: 10.1093/bioinformatics/btx078.
Despite considerable advancements in sequencing and computing technologies, de novo assembly of whole eukaryotic genomes is still a time-consuming task that requires a significant amount of computational resources and expertise. A targeted assembly approach to perform local assembly of sequences of interest remains a valuable option for some applications. This is especially true for gene-centric assemblies, whose resulting sequence can be readily utilized for more focused biological research. Here we describe Kollector, an alignment-free targeted assembly pipeline that uses thousands of transcript sequences concurrently to inform the localized assembly of corresponding gene loci. Kollector robustly reconstructs introns and novel sequences within these loci, and scales well to large genomes-properties that makes it especially useful for researchers working on non-model eukaryotic organisms.
We demonstrate the performance of Kollector for assembling complete or near-complete Caenorhabditis elegans and Homo sapiens gene loci from their respective, input transcripts. In a time- and memory-efficient manner, the Kollector pipeline successfully reconstructs respectively 99% and 80% (compared to 86% and 73% with standard de novo assembly techniques) of C.elegans and H.sapiens transcript targets in their corresponding genomic space using whole genome shotgun sequencing reads. We also show that Kollector outperforms both established and recently released targeted assembly tools. Finally, we demonstrate three use cases for Kollector, including comparative and cancer genomics applications.
Kollector is implemented as a bash script, and is available at https://github.com/bcgsc/kollector.
Supplementary data are available at Bioinformatics online.
尽管测序和计算技术取得了显著进展,但真核生物全基因组的从头组装仍然是一项耗时的任务,需要大量的计算资源和专业知识。对于某些应用而言,采用靶向组装方法对感兴趣的序列进行局部组装仍是一种有价值的选择。对于以基因为中心的组装尤其如此,其产生的序列可很容易地用于更具针对性的生物学研究。在此,我们描述了Kollector,这是一种无比对靶向组装流程,它同时使用数千个转录本序列来指导相应基因座的局部组装。Kollector能够可靠地重建这些基因座内的内含子和新序列,并且能够很好地扩展到大型基因组——这些特性使其对研究非模式真核生物的研究人员特别有用。
我们展示了Kollector从各自的输入转录本中组装完整或接近完整的秀丽隐杆线虫和人类基因座的性能。以高效利用时间和内存的方式,Kollector流程使用全基因组鸟枪法测序读数,在相应的基因组空间中分别成功重建了秀丽隐杆线虫和人类转录本靶标的99%和80%(与标准从头组装技术的86%和73%相比)。我们还表明,Kollector优于已有的和最近发布的靶向组装工具。最后,我们展示了Kollector的三个用例,包括比较基因组学和癌症基因组学应用。
Kollector作为一个bash脚本实现,可在https://github.com/bcgsc/kollector获取。
补充数据可在《生物信息学》在线获取。