Suppr超能文献

推高 HiFi 组装的极限揭示了两个拟南芥基因组之间着丝粒的多样性。

Pushing the limits of HiFi assemblies reveals centromere diversity between two Arabidopsis thaliana genomes.

机构信息

Department of Molecular Biology, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany.

Genomics Technologies, Corteva Agriscience, Johnston, IA 50131, USA.

出版信息

Nucleic Acids Res. 2022 Nov 28;50(21):12309-12327. doi: 10.1093/nar/gkac1115.

Abstract

Although long-read sequencing can often enable chromosome-level reconstruction of genomes, it is still unclear how one can routinely obtain gapless assemblies. In the model plant Arabidopsis thaliana, other than the reference accession Col-0, all other accessions de novo assembled with long-reads until now have used PacBio continuous long reads (CLR). Although these assemblies sometimes achieved chromosome-arm level contigs, they inevitably broke near the centromeres, excluding megabases of DNA from analysis in pan-genome projects. Since PacBio high-fidelity (HiFi) reads circumvent the high error rate of CLR technologies, albeit at the expense of read length, we compared a CLR assembly of accession Eyach15-2 to HiFi assemblies of the same sample. The use of five different assemblers starting from subsampled data allowed us to evaluate the impact of coverage and read length. We found that centromeres and rDNA clusters are responsible for 71% of contig breaks in the CLR scaffolds, while relatively short stretches of GA/TC repeats are at the core of >85% of the unfilled gaps in our best HiFi assemblies. Since the HiFi technology consistently enabled us to reconstruct gapless centromeres and 5S rDNA clusters, we demonstrate the value of the approach by comparing these previously inaccessible regions of the genome between the Eyach15-2 accession and the reference accession Col-0.

摘要

尽管长读测序通常可以实现染色体级别的基因组重建,但目前尚不清楚如何常规地获得无间隙的组装。在模式植物拟南芥中,除了参考品系 Col-0 之外,迄今为止,所有使用长读测序从头组装的其他品系都使用了 PacBio 连续长读(CLR)。尽管这些组装有时可以达到染色体臂级别的 contigs,但它们不可避免地在着丝粒附近断裂,从而排除了泛基因组项目中对兆碱基 DNA 的分析。由于 PacBio 高保真(HiFi)reads 规避了 CLR 技术的高错误率,尽管这是以读长为代价的,我们比较了 Eyach15-2 品系的 CLR 组装和同一样本的 HiFi 组装。使用五个不同的组装程序从亚采样数据开始,我们评估了覆盖度和读长的影响。我们发现,着丝粒和 rDNA 簇负责 CLR 支架中 71%的 contig 断裂,而相对较短的 GA/TC 重复序列是我们最好的 HiFi 组装中>85%未填补缺口的核心。由于 HiFi 技术始终能够重建无间隙的着丝粒和 5S rDNA 簇,我们通过比较 Eyach15-2 品系和参考品系 Col-0 之间基因组的这些以前无法访问的区域来证明该方法的价值。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6cf6/9757041/b264478ac45f/gkac1115fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验