Suppr超能文献

高质量的长读长序列对于实现生物多样性基因组学的潜力至关重要。

Highly accurate long reads are crucial for realizing the potential of biodiversity genomics.

机构信息

Department of Watershed Sciences, Utah State University, Logan, UT, USA.

DNA Sequencing Center, Department of Biology, Brigham Young University, Provo, UT, USA.

出版信息

BMC Genomics. 2023 Mar 16;24(1):117. doi: 10.1186/s12864-023-09193-9.

Abstract

BACKGROUND

Generating the most contiguous, accurate genome assemblies given available sequencing technologies is a long-standing challenge in genome science. With the rise of long-read sequencing, assembly challenges have shifted from merely increasing contiguity to correctly assembling complex, repetitive regions of interest, ideally in a phased manner. At present, researchers largely choose between two types of long read data: longer, but less accurate sequences, or highly accurate, but shorter reads (i.e., >Q20 or 99% accurate). To better understand how these types of long-read data as well as scale of data (i.e., mean length and sequencing depth) influence genome assembly outcomes, we compared genome assemblies for a caddisfly, Hesperophylax magnus, generated with longer, but less accurate, Oxford Nanopore (ONT) R9.4.1 and highly accurate PacBio HiFi (HiFi) data. Next, we expanded this comparison to consider the influence of highly accurate long-read sequence data on genome assemblies across 6750 plant and animal genomes. For this broader comparison, we used HiFi data as a surrogate for highly accurate long-reads broadly as we could identify when they were used from GenBank metadata.

RESULTS

HiFi reads outperformed ONT reads in all assembly metrics tested for the caddisfly data set and allowed for accurate assembly of the repetitive ~ 20 Kb H-fibroin gene. Across plants and animals, genome assemblies that incorporated HiFi reads were also more contiguous. For plants, the average HiFi assembly was 501% more contiguous (mean contig N50 = 20.5 Mb) than those generated with any other long-read data (mean contig N50 = 4.1 Mb). For animals, HiFi assemblies were 226% more contiguous (mean contig N50 = 20.9 Mb) versus other long-read assemblies (mean contig N50 = 9.3 Mb). In plants, we also found limited evidence that HiFi may offer a unique solution for overcoming genomic complexity that scales with assembly size.

CONCLUSIONS

Highly accurate long-reads generated with HiFi or analogous technologies represent a key tool for maximizing genome assembly quality for a wide swath of plants and animals. This finding is particularly important when resources only allow for one type of sequencing data to be generated. Ultimately, to realize the promise of biodiversity genomics, we call for greater uptake of highly accurate long-reads in future studies.

摘要

背景

在基因组科学中,根据现有测序技术生成最连续、最准确的基因组组装是一个长期存在的挑战。随着长读测序的兴起,组装挑战已经从仅仅增加连续性转变为正确组装复杂、重复的感兴趣区域,理想情况下是以分阶段的方式进行。目前,研究人员主要在两种类型的长读数据之间进行选择:更长但准确性较低的序列,或高度准确但较短的读段(即 >Q20 或 99%准确)。为了更好地了解这些类型的长读数据以及数据规模(即平均长度和测序深度)如何影响基因组组装结果,我们比较了使用更长但准确性较低的牛津纳米孔(ONT)R9.4.1 和高度准确的 PacBio HiFi(HiFi)数据生成的灯蛾 Hesperophylax magnus 的基因组组装。接下来,我们将这种比较扩展到考虑高度准确的长读序列数据对 6750 种植物和动物基因组组装的影响。对于更广泛的比较,我们使用 HiFi 数据作为高度准确的长读数据的替代物,因为我们可以从 GenBank 元数据中识别何时使用了它们。

结果

在灯蛾数据集的所有组装指标测试中,HiFi 读段的表现均优于 ONT 读段,并且能够准确组装重复约 20 Kb 的 H-丝素基因。在植物和动物中,包含 HiFi 读段的基因组组装也更加连续。对于植物,平均 HiFi 组装比使用任何其他长读数据生成的组装连续度高 501%(平均连续体 N50=20.5 Mb)(平均连续体 N50=4.1 Mb)。对于动物,HiFi 组装的连续性比其他长读组装高 226%(平均连续体 N50=20.9 Mb)(平均连续体 N50=9.3 Mb)。在植物中,我们还发现有限的证据表明,HiFi 可能为克服与组装大小成正比的基因组复杂性提供独特的解决方案。

结论

使用 HiFi 或类似技术生成的高度准确的长读数据代表了提高广泛的植物和动物基因组组装质量的关键工具。当资源只允许生成一种类型的测序数据时,这一发现尤其重要。最终,为了实现生物多样性基因组学的承诺,我们呼吁在未来的研究中更多地采用高度准确的长读数据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fd5d/10018877/8a524a4fe3be/12864_2023_9193_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验