Suppr超能文献

HiC-Hiker:一种基于 Hi-C 技术确定染色体长度支架中连续序列方向的概率模型。

HiC-Hiker: a probabilistic model to determine contig orientation in chromosome-length scaffolds with Hi-C.

机构信息

Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba 277-8562, Japan.

出版信息

Bioinformatics. 2020 Jul 1;36(13):3966-3974. doi: 10.1093/bioinformatics/btaa288.

Abstract

MOTIVATION

De novo assembly of reference-quality genomes used to require enormously laborious tasks. In particular, it is extremely time-consuming to build genome markers for ordering assembled contigs along chromosomes; thus, they are only available for well-established model organisms. To resolve this issue, recent studies demonstrated that Hi-C could be a powerful and cost-effective means to output chromosome-length scaffolds for non-model species with no genome marker resources, because the Hi-C contact frequency between a pair of two loci can be a good estimator of their genomic distance, even if there is a large gap between them. Indeed, state-of-the-art methods such as 3D-DNA are now widely used for locating contigs in chromosomes. However, it remains challenging to reduce errors in contig orientation because shorter contigs have fewer contacts with their neighboring contigs. These orientation errors lower the accuracy of gene prediction, read alignment, and synteny block estimation in comparative genomics.

RESULTS

To reduce these contig orientation errors, we propose a new algorithm, named HiC-Hiker, which has a firm grounding in probabilistic theory, rigorously models Hi-C contacts across contigs, and effectively infers the most probable orientations via the Viterbi algorithm. We compared HiC-Hiker and 3D-DNA using human and worm genome contigs generated from short reads, evaluated their performances, and observed a remarkable reduction in the contig orientation error rate from 4.3% (3D-DNA) to 1.7% (HiC-Hiker). Our algorithm can consider long-range information between distal contigs and precisely estimates Hi-C read contact probabilities among contigs, which may also be useful for determining the ordering of contigs.

AVAILABILITY AND IMPLEMENTATION

HiC-Hiker is freely available at: https://github.com/ryought/hic_hiker.

摘要

动机

从头组装参考质量的基因组曾经需要极其繁琐的任务。特别是,构建用于沿着染色体对组装的连续体进行排序的基因组标记非常耗时;因此,它们仅可用于成熟的模式生物。为了解决这个问题,最近的研究表明,Hi-C 可以成为一种强大且具有成本效益的手段,可以为没有基因组标记资源的非模式物种输出染色体长度支架,因为一对两个基因座之间的 Hi-C 接触频率可以很好地估计它们的基因组距离,即使它们之间存在很大的差距。事实上,现在像 3D-DNA 这样的最先进的方法被广泛用于在染色体中定位连续体。然而,由于较短的连续体与它们相邻的连续体的接触较少,因此仍然难以减少连续体方向的错误。这些方向错误会降低基因预测、读取对齐和比较基因组学中的同线性块估计的准确性。

结果

为了减少这些连续体方向错误,我们提出了一种新的算法,称为 HiC-Hiker,它在概率理论中有坚实的基础,严格地对连续体之间的 Hi-C 接触进行建模,并通过维特比算法有效地推断出最可能的方向。我们使用来自短读的人类和蠕虫基因组连续体比较了 HiC-Hiker 和 3D-DNA,评估了它们的性能,并观察到连续体方向错误率从 4.3%(3D-DNA)显著降低到 1.7%(HiC-Hiker)。我们的算法可以考虑远端连续体之间的长程信息,并精确估计连续体之间的 Hi-C 读取接触概率,这也可能有助于确定连续体的排序。

可用性和实现

HiC-Hiker 可在 https://github.com/ryought/hic_hiker 免费获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5913/7672694/f9d744ce1a0e/btaa288f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验