Graduate Program in Bioinformatics and Systems Biology, University of California San Diego, La Jolla, CA, USA.
Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA.
Nat Biotechnol. 2020 Nov;38(11):1309-1316. doi: 10.1038/s41587-020-0582-4. Epub 2020 Jul 14.
Centromeric variation has been linked to cancer and infertility, but centromere sequences contain multiple tandem repeats and can only be assembled manually from long error-prone reads. Here we describe the centroFlye algorithm for centromere assembly using long error-prone reads, and apply it to assemble human centromeres on chromosomes 6 and X. Our analyses reveal putative breakpoints in the manual reconstruction of the human X centromere, demonstrate that human X chromosome is partitioned into repeat subfamilies and provide initial insights into centromere evolution. We anticipate that centroFlye could be applied to automatically close remaining multimegabase gaps in the reference human genome.
着丝粒变异与癌症和不孕不育有关,但着丝粒序列包含多个串联重复序列,只能通过易错的长读段手动组装。在这里,我们描述了一种使用易错的长读段进行着丝粒组装的 centroFlye 算法,并将其应用于组装人类 6 号和 X 号染色体的着丝粒。我们的分析揭示了在手动重建人类 X 着丝粒时的潜在断点,表明人类 X 染色体被划分为重复亚家族,并为着丝粒进化提供了初步见解。我们预计 centroFlye 可以应用于自动填补人类参考基因组中剩余的多兆碱基缺口。