Department of Ecophysiology and Aquaculture, Leibniz-Institute of Freshwater Ecology and Inland Fisheries (IGB), Müggelseedamm 310, 12587 Berlin, Germany.
College of Fisheries, Chinese Perch Research Center, Huazhong Agricultural University; Innovation Base for Chinese Perch Breeding, Key Lab of Freshwater Animal Breeding, Ministry of Agriculture, No.1 Shizishan Street, Hongshan District, 430070 Wuhan, Hubei Province, P.R. China.
Gigascience. 2020 May 1;9(5). doi: 10.1093/gigascience/giaa034.
Easy-to-use and fast bioinformatics pipelines for long-read assembly that go beyond the contig level to generate highly continuous chromosome-scale genomes from raw data remain scarce.
Chromosome-Scale Assembler (CSA) is a novel computationally highly efficient bioinformatics pipeline that fills this gap. CSA integrates information from scaffolded assemblies (e.g., Hi-C or 10X Genomics) or even from diverged reference genomes into the assembly process. As CSA performs automated assembly of chromosome-sized scaffolds, we benchmark its performance against state-of-the-art reference genomes, i.e., conventionally built in a laborious fashion using multiple separate assembly tools and manual curation. CSA increases the contig lengths using scaffolding, local re-assembly, and gap closing. On certain datasets, initial contig N50 may be increased up to 4.5-fold. For smaller vertebrate genomes, chromosome-scale assemblies can be achieved within 12 h using low-cost, high-end desktop computers. Mammalian genomes can be processed within 16 h on compute-servers. Using diverged reference genomes for fish, birds, and mammals, we demonstrate that CSA calculates chromosome-scale assemblies from long-read data and genome comparisons alone. Even contig-level draft assemblies of diverged genomes are helpful for reconstructing chromosome-scale sequences. CSA is also capable of assembling ultra-long reads.
CSA can speed up and simplify chromosome-level assembly and significantly lower costs of large-scale family-level vertebrate genome projects.
从原始数据生成高度连续的染色体级基因组,超越了仅生成重叠群水平的易于使用且快速的长读长组装生物信息学管道仍然稀缺。
染色体级组装器(CSA)是一种新颖的计算效率极高的生物信息学管道,填补了这一空白。CSA 整合了支架组装(例如 Hi-C 或 10X Genomics)甚至来自分化参考基因组的信息到组装过程中。由于 CSA 自动组装染色体大小的支架,我们将其性能与最先进的参考基因组进行基准测试,即传统上使用多种单独的组装工具和手动整理以费力的方式构建。CSA 通过支架、局部重新组装和缺口闭合来增加重叠群的长度。在某些数据集上,初始重叠群 N50 可能增加高达 4.5 倍。对于较小的脊椎动物基因组,使用低成本的高端台式计算机可以在 12 小时内实现染色体级别的组装。哺乳动物基因组可以在计算服务器上 16 小时内处理。使用鱼类、鸟类和哺乳动物的分化参考基因组,我们证明 CSA 可以仅从长读长数据和基因组比较中计算染色体级别的组装。即使是分化基因组的重叠群级别草稿组装也有助于重建染色体级别的序列。CSA 还能够组装超长读取。
CSA 可以加快和简化染色体级别的组装,并大大降低大规模家族级脊椎动物基因组项目的成本。