• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用分组短 k-mer 匹配提高长读重叠检测的灵敏度。

Improving the sensitivity of long read overlap detection using grouped short k-mer matches.

机构信息

Department of Computer Science and Engineering, Michigan State University, East Lansing, 48824, MI, USA.

Electronic Engineering Department, City University of Hong Kong, Hong Kong SAR, China.

出版信息

BMC Genomics. 2019 Apr 4;20(Suppl 2):190. doi: 10.1186/s12864-019-5475-x.

DOI:10.1186/s12864-019-5475-x
PMID:30967123
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6456931/
Abstract

BACKGROUND

Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than second-generation sequencing technologies such as Illumina. The increased read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and characterize the intra-species variations. It also holds the promise to decipher the community structure in complex microbial communities because long reads help metagenomic assembly. One key step in genome assembly using long reads is to quickly identify reads forming overlaps. Because PacBio data has higher sequencing error rate and lower coverage than popular short read sequencing technologies (such as Illumina), efficient detection of true overlaps requires specially designed algorithms. In particular, there is still a need to improve the sensitivity of detecting small overlaps or overlaps with high error rates in both reads. Addressing this need will enable better assembly for metagenomic data produced by third-generation sequencing technologies.

RESULTS

In this work, we designed and implemented an overlap detection program named GroupK, for third-generation sequencing reads based on grouped k-mer hits. While using k-mer hits for detecting reads' overlaps has been adopted by several existing programs, our method uses a group of short k-mer hits satisfying statistically derived distance constraints to increase the sensitivity of small overlap detection. Grouped k-mer hit was originally designed for homology search. We are the first to apply group hit for long read overlap detection. The experimental results of applying our pipeline to both simulated and real third-generation sequencing data showed that GroupK enables more sensitive overlap detection, especially for datasets of low sequencing coverage.

CONCLUSIONS

GroupK is best used for detecting small overlaps for third-generation sequencing data. It provides a useful supplementary tool to existing ones for more sensitive and accurate overlap detection. The source code is freely available at https://github.com/Strideradu/GroupK .

摘要

背景

太平洋生物科学公司(Pacific BioSciences)开发的单分子实时测序(SMRT)技术产生的读长比第二代测序技术(如 Illumina)更长。更长的读长使 PacBio 测序能够缩小基因组组装中的缺口,揭示结构变异,并描述种内变异。它还有望破解复杂微生物群落中的群落结构,因为长读长有助于宏基因组组装。使用长读长进行基因组组装的关键步骤之一是快速识别形成重叠的读长。由于 PacBio 数据的测序错误率和覆盖率高于流行的短读测序技术(如 Illumina),因此需要专门设计的算法来有效检测真正的重叠。特别是,仍需要提高检测读长中较小重叠或高错误率重叠的灵敏度。满足这一需求将能够更好地组装第三代测序技术产生的宏基因组数据。

结果

在这项工作中,我们设计并实现了一种名为 GroupK 的重叠检测程序,用于基于分组 k-mer 命中的第三代测序读长。虽然已有几个现有程序采用 k-mer 命中来检测读长的重叠,但我们的方法使用一组满足统计衍生距离约束的短 k-mer 命中来提高小重叠检测的灵敏度。分组 k-mer 命中最初是为同源搜索设计的。我们是第一个将组命中应用于长读长重叠检测的人。将我们的流水线应用于模拟和真实第三代测序数据的实验结果表明,GroupK 能够更灵敏地检测重叠,特别是在测序覆盖率较低的数据集上。

结论

GroupK 最适合用于检测第三代测序数据的小重叠。它为现有工具提供了一个有用的补充,用于更灵敏和准确的重叠检测。源代码可在 https://github.com/Strideradu/GroupK 免费获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f304/6456931/fccdfea98b7d/12864_2019_5475_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f304/6456931/56a6ac1bb715/12864_2019_5475_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f304/6456931/e1e6950412fc/12864_2019_5475_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f304/6456931/7619054a27fa/12864_2019_5475_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f304/6456931/262990fb2155/12864_2019_5475_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f304/6456931/6182cb474724/12864_2019_5475_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f304/6456931/7c83579c0a51/12864_2019_5475_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f304/6456931/ec3b23ab93ff/12864_2019_5475_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f304/6456931/fccdfea98b7d/12864_2019_5475_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f304/6456931/56a6ac1bb715/12864_2019_5475_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f304/6456931/e1e6950412fc/12864_2019_5475_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f304/6456931/7619054a27fa/12864_2019_5475_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f304/6456931/262990fb2155/12864_2019_5475_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f304/6456931/6182cb474724/12864_2019_5475_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f304/6456931/7c83579c0a51/12864_2019_5475_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f304/6456931/ec3b23ab93ff/12864_2019_5475_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f304/6456931/fccdfea98b7d/12864_2019_5475_Fig8_HTML.jpg

相似文献

1
Improving the sensitivity of long read overlap detection using grouped short k-mer matches.利用分组短 k-mer 匹配提高长读重叠检测的灵敏度。
BMC Genomics. 2019 Apr 4;20(Suppl 2):190. doi: 10.1186/s12864-019-5475-x.
2
HISEA: HIerarchical SEed Aligner for PacBio data.HISEA:用于PacBio数据的分层种子比对器。
BMC Bioinformatics. 2017 Dec 19;18(1):564. doi: 10.1186/s12859-017-1953-9.
3
Improve homology search sensitivity of PacBio data by correcting frameshifts.通过校正移码来提高PacBio数据的同源性搜索灵敏度。
Bioinformatics. 2016 Sep 1;32(17):i529-i537. doi: 10.1093/bioinformatics/btw458.
4
A hybrid and scalable error correction algorithm for indel and substitution errors of long reads.一种用于长读段插入/缺失和替换错误的混合可扩展纠错算法。
BMC Genomics. 2019 Dec 20;20(Suppl 11):948. doi: 10.1186/s12864-019-6286-9.
5
RepLong: de novo repeat identification using long read sequencing data.RepLong:利用长读测序数据进行从头重复识别。
Bioinformatics. 2018 Apr 1;34(7):1099-1107. doi: 10.1093/bioinformatics/btx717.
6
LROD: An Overlap Detection Algorithm for Long Reads Based on -mer Distribution.LROD:一种基于-mer分布的长读段重叠检测算法。
Front Genet. 2020 Jul 29;11:632. doi: 10.3389/fgene.2020.00632. eCollection 2020.
7
Improved assembly of noisy long reads by k-mer validation.通过k-mer验证改进嘈杂长读段的组装。
Genome Res. 2016 Dec;26(12):1710-1720. doi: 10.1101/gr.209247.116. Epub 2016 Oct 7.
8
Evaluating long-read de novo assembly tools for eukaryotic genomes: insights and considerations.评估真核生物基因组的长读长从头组装工具:见解与考虑。
Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad100. Epub 2023 Nov 24.
9
NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly.NeatFreq:用于从头序列组装的无参考数据缩减和覆盖度归一化
BMC Bioinformatics. 2014 Nov 19;15(1):357. doi: 10.1186/s12859-014-0357-3.
10
A sensitive short read homology search tool for paired-end read sequencing data.一种用于双端读段测序数据的灵敏短读段同源性搜索工具。
BMC Bioinformatics. 2017 Oct 16;18(Suppl 12):414. doi: 10.1186/s12859-017-1826-2.

引用本文的文献

1
Seeding with minimized subsequence.最小化子序列播种。
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i232-i241. doi: 10.1093/bioinformatics/btad218.
2
cPlot: Contig-Plotting Visualization for the Analysis of Short-Read Nucleotide Sequence Alignments.cPlot:用于分析短读核苷酸序列比对的重叠群绘图可视化工具。
Int J Mol Sci. 2022 Sep 29;23(19):11484. doi: 10.3390/ijms231911484.
3
Effective sequence similarity detection with strobemers.利用频闪体进行有效的序列相似性检测。

本文引用的文献

1
Minimap2: pairwise alignment for nucleotide sequences.Minimap2:核苷酸序列的两两比对。
Bioinformatics. 2018 Sep 15;34(18):3094-3100. doi: 10.1093/bioinformatics/bty191.
2
KMC 3: counting and manipulating k-mer statistics.KMC 3:计算和处理k-mer统计信息。
Bioinformatics. 2017 Sep 1;33(17):2759-2761. doi: 10.1093/bioinformatics/btx304.
3
Innovations and challenges in detecting long read overlaps: an evaluation of the state-of-the-art.检测长读重叠中的创新与挑战:对当前技术水平的评估
Genome Res. 2021 Nov;31(11):2080-2094. doi: 10.1101/gr.275648.121. Epub 2021 Oct 19.
4
Hardware acceleration of genomics data analysis: challenges and opportunities.基因组数据分析的硬件加速:挑战与机遇
Bioinformatics. 2021 Jul 27;37(13):1785-1795. doi: 10.1093/bioinformatics/btab017.
Bioinformatics. 2017 Apr 15;33(8):1261-1270. doi: 10.1093/bioinformatics/btw811.
4
Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.Minimap和miniasm:用于有噪声长序列的快速映射和从头组装。
Bioinformatics. 2016 Jul 15;32(14):2103-10. doi: 10.1093/bioinformatics/btw152. Epub 2016 Mar 19.
5
Fast and sensitive mapping of nanopore sequencing reads with GraphMap.使用GraphMap对纳米孔测序读数进行快速灵敏的映射
Nat Commun. 2016 Apr 15;7:11307. doi: 10.1038/ncomms11307.
6
Resolving the Complexity of Human Skin Metagenomes Using Single-Molecule Sequencing.利用单分子测序解析人类皮肤宏基因组的复杂性
mBio. 2016 Feb 9;7(1):e01948-15. doi: 10.1128/mBio.01948-15.
7
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.利用单分子测序和局部敏感哈希组装大型基因组。
Nat Biotechnol. 2015 Jun;33(6):623-30. doi: 10.1038/nbt.3238. Epub 2015 May 25.
8
Resolving the complexity of the human genome using single-molecule sequencing.利用单分子测序解析人类基因组的复杂性。
Nature. 2015 Jan 29;517(7536):608-11. doi: 10.1038/nature13907. Epub 2014 Nov 10.
9
Single-molecule sequencing to track plasmid diversity of hospital-associated carbapenemase-producing Enterobacteriaceae.单分子测序追踪医院相关产碳青霉烯酶肠杆菌科细菌的质粒多样性
Sci Transl Med. 2014 Sep 17;6(254):254ra126. doi: 10.1126/scitranslmed.3009845.
10
An Elegant Algorithm for the Construction of Suffix Arrays.一种构建后缀数组的优雅算法。
J Discrete Algorithms (Amst). 2014 Jul 1;27:21-28. doi: 10.1016/j.jda.2014.03.001.