• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

局部后缀数组及其在双端短读长基因组图谱问题中的应用。

Localized suffix array and its application to genome mapping problems for paired-end short reads.

作者信息

Kimura Kouichi, Koike Asako

机构信息

Central Research Laboratory, Hitachi Ltd, 1-280 Higashi-Koigakubo, Kokubunji, Tokyo 185-8601, Japan.

出版信息

Genome Inform. 2009 Oct;23(1):60-71.

PMID:20180262
Abstract

We introduce a new data structure, a localized suffix array, based on which occurrence information is dynamically represented as the combination of global positional information and local lexicographic order information in text search applications. For the search of a pair of words within a given distance, many candidate positions that share a coarse-grained global position can be compactly represented in term of local lexicographic orders as in the conventional suffix array, and they can be simultaneously examined for violation of the distance constraint at the coarse-grained resolution. Trade-off between the positional and lexicographical information is progressively shifted towards finer positional resolution, and the distance constraint is reexamined accordingly. Thus the paired search can be efficiently performed even if there are a large number of occurrences for each word. The localized suffix array itself is in fact a reordering of bits inside the conventional suffix array, and their memory requirements are essentially the same. We demonstrate an application to genome mapping problems for paired-end short reads generated by new-generation DNA sequencers. When paired reads are highly repetitive, it is time-consuming to naïvely calculate, sort, and compare all of the coordinates. For a human genome re-sequencing data of 36 base pairs, more than 10 times speedups over the naïve method were observed in almost half of the cases where the sums of redundancies (number of individual occurrences) of paired reads were greater than 2,000.

摘要

我们引入了一种新的数据结构——局部后缀数组,在文本搜索应用中,基于该结构,出现信息被动态表示为全局位置信息和局部字典序信息的组合。对于在给定距离内搜索一对单词的情况,许多共享粗粒度全局位置的候选位置可以像在传统后缀数组中那样,根据局部字典序进行紧凑表示,并且可以在粗粒度分辨率下同时检查它们是否违反距离约束。位置信息和字典序信息之间的权衡逐渐向更精细的位置分辨率转移,并相应地重新检查距离约束。因此,即使每个单词有大量出现情况,配对搜索也能高效执行。局部后缀数组本身实际上是传统后缀数组内部位的重新排序,它们的内存需求基本相同。我们展示了其在新一代DNA测序仪生成的双端短读段的基因组映射问题中的应用。当双端读段高度重复时,单纯计算、排序和比较所有坐标会很耗时。对于一个36个碱基对的人类基因组重测序数据,在几乎一半的双端读段冗余度(单个出现次数)总和大于2000的情况下,观察到比单纯方法快10倍以上的加速。

相似文献

1
Localized suffix array and its application to genome mapping problems for paired-end short reads.局部后缀数组及其在双端短读长基因组图谱问题中的应用。
Genome Inform. 2009 Oct;23(1):60-71.
2
Optimal spliced alignments of short sequence reads.短序列 reads 的最优剪接比对。
Bioinformatics. 2008 Aug 15;24(16):i174-80. doi: 10.1093/bioinformatics/btn300.
3
Indexing huge genome sequences for solving various problems.为解决各种问题对庞大的基因组序列进行索引。
Genome Inform. 2001;12:175-83.
4
The new paradigm of flow cell sequencing.流动槽测序的新范式。
Genome Res. 2008 Jun;18(6):839-46. doi: 10.1101/gr.073262.107.
5
De novo sequencing of plant genomes using second-generation technologies.利用第二代技术对植物基因组进行从头测序。
Brief Bioinform. 2009 Nov;10(6):609-18. doi: 10.1093/bib/bbp039.
6
Analysis of common k-mers for whole genome sequences using SSB-tree.使用SSB树对全基因组序列的常见k-mer进行分析。
Genome Inform. 2002;13:30-41.
7
Sequencing Medicago truncatula expressed sequenced tags using 454 Life Sciences technology.使用454生命科学技术对蒺藜苜蓿表达序列标签进行测序。
BMC Genomics. 2006 Oct 24;7:272. doi: 10.1186/1471-2164-7-272.
8
SeeGH--a software tool for visualization of whole genome array comparative genomic hybridization data.SeeGH——一种用于可视化全基因组阵列比较基因组杂交数据的软件工具。
BMC Bioinformatics. 2004 Feb 9;5:13. doi: 10.1186/1471-2105-5-13.
9
A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes.一种计算K-mer频率的新方法及其在大型重复植物基因组注释中的应用。
BMC Genomics. 2008 Oct 31;9:517. doi: 10.1186/1471-2164-9-517.
10
Tracembler--software for in-silico chromosome walking in unassembled genomes.Tracembler——用于未组装基因组中电子染色体步移的软件。
BMC Bioinformatics. 2007 May 9;8:151. doi: 10.1186/1471-2105-8-151.