• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

赫拉克勒斯:一种基于轮廓隐马尔可夫模型的长读混合纠错算法。

Hercules: a profile HMM-based hybrid error correction algorithm for long reads.

机构信息

Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey.

Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA.

出版信息

Nucleic Acids Res. 2018 Nov 30;46(21):e125. doi: 10.1093/nar/gky724.

DOI:10.1093/nar/gky724
PMID:30124947
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6265270/
Abstract

Choosing whether to use second or third generation sequencing platforms can lead to trade-offs between accuracy and read length. Several types of studies require long and accurate reads. In such cases researchers often combine both technologies and the erroneous long reads are corrected using the short reads. Current approaches rely on various graph or alignment based techniques and do not take the error profile of the underlying technology into account. Efficient machine learning algorithms that address these shortcomings have the potential to achieve more accurate integration of these two technologies. We propose Hercules, the first machine learning-based long read error correction algorithm. Hercules models every long read as a profile Hidden Markov Model with respect to the underlying platform's error profile. The algorithm learns a posterior transition/emission probability distribution for each long read to correct errors in these reads. We show on two DNA-seq BAC clones (CH17-157L1 and CH17-227A2) that Hercules-corrected reads have the highest mapping rate among all competing algorithms and have the highest accuracy when the breadth of coverage is high. On a large human CHM1 cell line WGS data set, Hercules is one of the few scalable algorithms; and among those, it achieves the highest accuracy.

摘要

选择使用第二代或第三代测序平台可能会导致准确性和读长之间的权衡。有几种类型的研究需要长而准确的读数。在这种情况下,研究人员通常会结合这两种技术,并且使用短读数来纠正错误的长读数。当前的方法依赖于各种基于图或比对的技术,并且没有考虑到基础技术的错误分布。解决这些缺点的高效机器学习算法有可能实现这两种技术的更精确整合。我们提出了 Hercules,这是第一个基于机器学习的长读错误纠正算法。Hercules 针对基础平台的错误分布,将每个长读建模为一个关于隐藏马尔可夫模型的分布。该算法为每个长读学习后验转移/发射概率分布,以纠正这些读中的错误。我们在两个 DNA-seq BAC 克隆 (CH17-157L1 和 CH17-227A2) 上表明,在所有竞争算法中,Hercules 纠正的读数具有最高的映射率,并且在覆盖率高时具有最高的准确性。在大型人类 CHM1 细胞系 WGS 数据集上,Hercules 是少数可扩展算法之一;在这些算法中,它实现了最高的准确性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48d0/6265270/9fbeadc5bb2c/gky724fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48d0/6265270/f000362efe13/gky724fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48d0/6265270/ab2ec6bd702f/gky724fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48d0/6265270/60ac6f4e80f6/gky724fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48d0/6265270/e7540f80b761/gky724fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48d0/6265270/9fbeadc5bb2c/gky724fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48d0/6265270/f000362efe13/gky724fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48d0/6265270/ab2ec6bd702f/gky724fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48d0/6265270/60ac6f4e80f6/gky724fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48d0/6265270/e7540f80b761/gky724fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48d0/6265270/9fbeadc5bb2c/gky724fig5.jpg

相似文献

1
Hercules: a profile HMM-based hybrid error correction algorithm for long reads.赫拉克勒斯:一种基于轮廓隐马尔可夫模型的长读混合纠错算法。
Nucleic Acids Res. 2018 Nov 30;46(21):e125. doi: 10.1093/nar/gky724.
2
A hybrid and scalable error correction algorithm for indel and substitution errors of long reads.一种用于长读段插入/缺失和替换错误的混合可扩展纠错算法。
BMC Genomics. 2019 Dec 20;20(Suppl 11):948. doi: 10.1186/s12864-019-6286-9.
3
NmTHC: a hybrid error correction method based on a generative neural machine translation model with transfer learning.NmTHC:一种基于具有迁移学习的生成式神经机器翻译模型的混合错误纠正方法。
BMC Genomics. 2024 Jun 7;25(1):573. doi: 10.1186/s12864-024-10446-4.
4
Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm.阿波罗:一种与测序技术无关、可扩展且准确的组装后处理算法。
Bioinformatics. 2020 Jun 1;36(12):3669-3679. doi: 10.1093/bioinformatics/btaa179.
5
QuorUM: An Error Corrector for Illumina Reads.QuorUM:Illumina测序读数的纠错工具
PLoS One. 2015 Jun 17;10(6):e0130821. doi: 10.1371/journal.pone.0130821. eCollection 2015.
6
In search of perfect reads.寻找完美的读数。
BMC Bioinformatics. 2015;16 Suppl 17(Suppl 17):S7. doi: 10.1186/1471-2105-16-S17-S7. Epub 2015 Dec 7.
7
HALC: High throughput algorithm for long read error correction.HALC:用于长读长纠错的高通量算法。
BMC Bioinformatics. 2017 Apr 5;18(1):204. doi: 10.1186/s12859-017-1610-3.
8
HECIL: A Hybrid Error Correction Algorithm for Long Reads with Iterative Learning.HECIL:一种具有迭代学习的长读长混合纠错算法。
Sci Rep. 2018 Jul 2;8(1):9936. doi: 10.1038/s41598-018-28364-3.
9
Hybrid correction of highly noisy long reads using a variable-order de Bruijn graph.使用变阶 de Bruijn 图对高度嘈杂的长读进行混合纠错。
Bioinformatics. 2018 Dec 15;34(24):4213-4222. doi: 10.1093/bioinformatics/bty521.
10
Performance difference of graph-based and alignment-based hybrid error correction methods for error-prone long reads.基于图的和基于比对的混合纠错方法在易错长读段上的性能差异。
Genome Biol. 2020 Jan 17;21(1):14. doi: 10.1186/s13059-019-1885-y.

引用本文的文献

1
DeepCorr: a novel error correction method for 3GS long reads based on deep learning.DeepCorr:一种基于深度学习的针对3GS长读段的新型错误校正方法。
PeerJ Comput Sci. 2024 Jul 26;10:e2160. doi: 10.7717/peerj-cs.2160. eCollection 2024.
2
NmTHC: a hybrid error correction method based on a generative neural machine translation model with transfer learning.NmTHC:一种基于具有迁移学习的生成式神经机器翻译模型的混合错误纠正方法。
BMC Genomics. 2024 Jun 7;25(1):573. doi: 10.1186/s12864-024-10446-4.
3
Genome assembly in the telomere-to-telomere era.

本文引用的文献

1
Minimap2: pairwise alignment for nucleotide sequences.Minimap2:核苷酸序列的两两比对。
Bioinformatics. 2018 Sep 15;34(18):3094-3100. doi: 10.1093/bioinformatics/bty191.
2
Nanopore sequencing and assembly of a human genome with ultra-long reads.纳米孔测序和超长读长组装人类基因组。
Nat Biotechnol. 2018 Apr;36(4):338-345. doi: 10.1038/nbt.4060. Epub 2018 Jan 29.
3
Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis.太平洋生物科学公司和牛津纳米孔技术公司的全面比较及其在转录组分析中的应用。
端粒到端粒时代的基因组组装。
Nat Rev Genet. 2024 Sep;25(9):658-670. doi: 10.1038/s41576-024-00718-w. Epub 2024 Apr 22.
4
RUBICON: a framework for designing efficient deep learning-based genomic basecallers.RUBICON:一种用于设计高效深度学习基因组碱基调用器的框架。
Genome Biol. 2024 Feb 16;25(1):49. doi: 10.1186/s13059-024-03181-2.
5
Next-generation fungal identification using target enrichment and Nanopore sequencing.基于靶向富集和纳米孔测序的新一代真菌鉴定
BMC Genomics. 2023 Oct 2;24(1):581. doi: 10.1186/s12864-023-09691-w.
6
Applications of long-read sequencing to Mendelian genetics.长读测序在孟德尔遗传学中的应用。
Genome Med. 2023 Jun 14;15(1):42. doi: 10.1186/s13073-023-01194-3.
7
VeChat: correcting errors in long reads using variation graphs.VeChat:使用变异图谱纠正长读中的错误。
Nat Commun. 2022 Nov 4;13(1):6657. doi: 10.1038/s41467-022-34381-8.
8
Genome sequence assembly algorithms and misassembly identification methods.基因组序列组装算法和错误组装识别方法。
Mol Biol Rep. 2022 Nov;49(11):11133-11148. doi: 10.1007/s11033-022-07919-8. Epub 2022 Sep 23.
9
Nanopore sequencing technology, bioinformatics and applications.纳米孔测序技术、生物信息学及其应用。
Nat Biotechnol. 2021 Nov;39(11):1348-1365. doi: 10.1038/s41587-021-01108-x. Epub 2021 Nov 8.
10
Music of metagenomics-a review of its applications, analysis pipeline, and associated tools.宏基因组学音乐——应用、分析流程及其相关工具的综述。
Funct Integr Genomics. 2022 Feb;22(1):3-26. doi: 10.1007/s10142-021-00810-y. Epub 2021 Oct 18.
F1000Res. 2017 Feb 3;6:100. doi: 10.12688/f1000research.10571.2. eCollection 2017.
4
HALC: High throughput algorithm for long read error correction.HALC:用于长读长纠错的高通量算法。
BMC Bioinformatics. 2017 Apr 5;18(1):204. doi: 10.1186/s12859-017-1610-3.
5
CoLoRMap: Correcting Long Reads by Mapping short reads.CoLoRMap:通过映射短读段来校正长读段
Bioinformatics. 2016 Sep 1;32(17):i545-i551. doi: 10.1093/bioinformatics/btw463.
6
Accurate self-correction of errors in long reads using de Bruijn graphs.使用德布鲁因图对长读段中的错误进行准确的自我校正。
Bioinformatics. 2017 Mar 15;33(6):799-806. doi: 10.1093/bioinformatics/btw321.
7
On genomic repeats and reproducibility.关于基因组重复和可重复性。
Bioinformatics. 2016 Aug 1;32(15):2243-7. doi: 10.1093/bioinformatics/btw139. Epub 2016 Mar 11.
8
Jabba: hybrid error correction for long sequencing reads.贾巴:针对长测序读段的混合错误校正。
Algorithms Mol Biol. 2016 May 3;11:10. doi: 10.1186/s13015-016-0075-7. eCollection 2016.
9
Genetic variation and the de novo assembly of human genomes.人类基因组的遗传变异与从头组装
Nat Rev Genet. 2015 Nov;16(11):627-40. doi: 10.1038/nrg3933. Epub 2015 Oct 7.
10
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.利用单分子测序和局部敏感哈希组装大型基因组。
Nat Biotechnol. 2015 Jun;33(6):623-30. doi: 10.1038/nbt.3238. Epub 2015 May 25.