• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于压缩的基因组比对算法。

A genome alignment algorithm based on compression.

机构信息

Clayton School of Information Technology, Monash University, Clayton 3800, Australia.

出版信息

BMC Bioinformatics. 2010 Dec 16;11:599. doi: 10.1186/1471-2105-11-599.

DOI:10.1186/1471-2105-11-599
PMID:21159205
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3022628/
Abstract

BACKGROUND

Traditional genome alignment methods consider sequence alignment as a variation of the string edit distance problem, and perform alignment by matching characters of the two sequences. They are often computationally expensive and unable to deal with low information regions. Furthermore, they lack a well-principled objective function to measure the performance of sets of parameters. Since genomic sequences carry genetic information, this article proposes that the information content of each nucleotide in a position should be considered in sequence alignment. An information-theoretic approach for pairwise genome local alignment, namely XMAligner, is presented. Instead of comparing sequences at the character level, XMAligner considers a pair of nucleotides from two sequences to be related if their mutual information in context is significant. The information content of nucleotides in sequences is measured by a lossless compression technique.

RESULTS

Experiments on both simulated data and real data show that XMAligner is superior to conventional methods especially on distantly related sequences and statistically biased data. XMAligner can align sequences of eukaryote genome size with only a modest hardware requirement. Importantly, the method has an objective function which can obviate the need to choose parameter values for high quality alignment. The alignment results from XMAligner can be integrated into a visualisation tool for viewing purpose.

CONCLUSIONS

The information-theoretic approach for sequence alignment is shown to overcome the mentioned problems of conventional character matching alignment methods. The article shows that, as genomic sequences are meant to carry information, considering the information content of nucleotides is helpful for genomic sequence alignment.

AVAILABILITY

Downloadable binaries, documentation and data can be found at ftp://ftp.infotech.monash.edu.au/software/DNAcompress-XM/XMAligner/.

摘要

背景

传统的基因组比对方法将序列比对视为字符串编辑距离问题的变体,并通过匹配两个序列的字符来进行比对。它们通常计算成本高昂,并且无法处理低信息区域。此外,它们缺乏一种良好的有原则的目标函数来衡量参数集的性能。由于基因组序列携带遗传信息,本文提出在序列比对中应考虑每个位置核苷酸的信息含量。提出了一种用于成对基因组局部比对的信息论方法,即 XMAligner。XMAligner 不是在字符级别比较序列,而是考虑如果两个序列中一对核苷酸的上下文互信息显著,则它们相关。序列中核苷酸的信息含量通过无损压缩技术来衡量。

结果

在模拟数据和真实数据上的实验表明,XMAligner 优于传统方法,尤其是在远缘序列和统计偏差数据上。XMAligner 可以仅使用适度的硬件要求对齐真核生物基因组大小的序列。重要的是,该方法具有一个目标函数,可以避免为高质量对齐选择参数值的需要。XMAligner 的对齐结果可以集成到可视化工具中用于查看。

结论

序列比对的信息论方法被证明可以克服传统字符匹配比对方法的所述问题。本文表明,由于基因组序列旨在携带信息,因此考虑核苷酸的信息含量有助于基因组序列比对。

可用性

可下载的二进制文件、文档和数据可在 ftp://ftp.infotech.monash.edu.au/software/DNAcompress-XM/XMAligner/ 找到。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94cf/3022628/24a8501ae345/1471-2105-11-599-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94cf/3022628/3d3fe7d45ef0/1471-2105-11-599-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94cf/3022628/641d46f6bd8b/1471-2105-11-599-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94cf/3022628/9ce3c39e0a3c/1471-2105-11-599-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94cf/3022628/f2cd679f01e9/1471-2105-11-599-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94cf/3022628/143514e51cf8/1471-2105-11-599-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94cf/3022628/24a8501ae345/1471-2105-11-599-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94cf/3022628/3d3fe7d45ef0/1471-2105-11-599-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94cf/3022628/641d46f6bd8b/1471-2105-11-599-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94cf/3022628/9ce3c39e0a3c/1471-2105-11-599-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94cf/3022628/f2cd679f01e9/1471-2105-11-599-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94cf/3022628/143514e51cf8/1471-2105-11-599-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94cf/3022628/24a8501ae345/1471-2105-11-599-6.jpg

相似文献

1
A genome alignment algorithm based on compression.基于压缩的基因组比对算法。
BMC Bioinformatics. 2010 Dec 16;11:599. doi: 10.1186/1471-2105-11-599.
2
Statistical inference of protein structural alignments using information and compression.利用信息与压缩技术对蛋白质结构比对进行统计推断
Bioinformatics. 2017 Apr 1;33(7):1005-1013. doi: 10.1093/bioinformatics/btw757.
3
Genomic multiple sequence alignments: refinement using a genetic algorithm.基因组多序列比对:使用遗传算法进行优化
BMC Bioinformatics. 2005 Aug 8;6:200. doi: 10.1186/1471-2105-6-200.
4
Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment.通过通用相似性度量对生物序列和结构进行基于压缩的分类:实验评估
BMC Bioinformatics. 2007 Jul 13;8:252. doi: 10.1186/1471-2105-8-252.
5
MAP2: multiple alignment of syntenic genomic sequences.MAP2:同线基因组序列的多重比对。
Nucleic Acids Res. 2005 Jan 7;33(1):162-70. doi: 10.1093/nar/gki159. Print 2005.
6
Evolution of biological sequences implies an extreme value distribution of type I for both global and local pairwise alignment scores.生物序列的进化意味着全局和局部两两比对得分都呈I型极值分布。
BMC Bioinformatics. 2008 Aug 7;9:332. doi: 10.1186/1471-2105-9-332.
7
A greedy, graph-based algorithm for the alignment of multiple homologous gene lists.一种基于图的贪婪算法,用于对齐多个同源基因列表。
Bioinformatics. 2011 Mar 15;27(6):749-56. doi: 10.1093/bioinformatics/btr008. Epub 2011 Jan 6.
8
TotalReCaller: improved accuracy and performance via integrated alignment and base-calling.TotalReCaller:通过集成的对准和碱基调用提高准确性和性能。
Bioinformatics. 2011 Sep 1;27(17):2330-7. doi: 10.1093/bioinformatics/btr393. Epub 2011 Jun 30.
9
Compressed pattern matching in DNA sequences.DNA序列中的压缩模式匹配
Proc IEEE Comput Syst Bioinform Conf. 2004:62-8. doi: 10.1109/csb.2004.1332418.
10
A new statistical framework to assess structural alignment quality using information compression.一种使用信息压缩来评估结构比对质量的新统计框架。
Bioinformatics. 2014 Sep 1;30(17):i512-8. doi: 10.1093/bioinformatics/btu460.

引用本文的文献

1
The complexity landscape of viral genomes.病毒基因组的复杂性景观。
Gigascience. 2022 Aug 11;11. doi: 10.1093/gigascience/giac079.
2
Streaming algorithms for identification of pathogens and antibiotic resistance potential from real-time MinION(TM) sequencing.基于实时 MinION(TM)测序的病原体与抗生素耐药性识别流算法。
Gigascience. 2016 Jul 26;5(1):32. doi: 10.1186/s13742-016-0137-2.
3
Inferring short tandem repeat variation from paired-end short reads.从双端短读序列推断短串联重复序列变异。

本文引用的文献

1
A biological compression model and its applications.生物压缩模型及其应用。
Adv Exp Med Biol. 2011;696:657-66. doi: 10.1007/978-1-4419-7046-6_67.
2
Towards realistic benchmarks for multiple alignments of non-coding sequences.针对非编码序列多重比对的现实基准。
BMC Bioinformatics. 2010 Jan 26;11:54. doi: 10.1186/1471-2105-11-54.
3
Plasmodium falciparum and Plasmodium vivax: so similar, yet very different.恶性疟原虫和间日疟原虫:如此相似,却又截然不同。
Nucleic Acids Res. 2014 Feb;42(3):e16. doi: 10.1093/nar/gkt1313. Epub 2013 Dec 17.
4
Data compression for sequencing data.测序数据的数据压缩
Algorithms Mol Biol. 2013 Nov 18;8(1):25. doi: 10.1186/1748-7188-8-25.
5
On the representability of complete genomes by multiple competing finite-context (Markov) models.多竞争有限上下文(马尔可夫)模型对完整基因组的表示能力。
PLoS One. 2011;6(6):e21588. doi: 10.1371/journal.pone.0021588. Epub 2011 Jun 30.
Parasitol Res. 2009 Oct;105(4):1169-71. doi: 10.1007/s00436-009-1521-y. Epub 2009 Jun 20.
4
Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty.使用多个参数集进行局部序列比对的成对统计显著性以及参数集变化罚分的经验依据。
BMC Bioinformatics. 2009 Mar 19;10 Suppl 3(Suppl 3):S1. doi: 10.1186/1471-2105-10-S3-S1.
5
Genome bias influences amino acid choices: analysis of amino acid substitution and re-compilation of substitution matrices exclusive to an AT-biased genome.基因组偏向性影响氨基酸选择:对氨基酸替换的分析以及对AT偏向性基因组特有的替换矩阵的重新编译。
Nucleic Acids Res. 2008 Dec;36(21):6664-75. doi: 10.1093/nar/gkn635. Epub 2008 Oct 23.
6
Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome.对1%人类基因组的深度哺乳动物序列比对和约束预测分析。
Genome Res. 2007 Jun;17(6):760-74. doi: 10.1101/gr.6034307.
7
Comparative analysis of long DNA sequences by per element information content using different contexts.使用不同上下文,通过每个元件的信息含量对长DNA序列进行比较分析。
BMC Bioinformatics. 2007 May 3;8 Suppl 2(Suppl 2):S10. doi: 10.1186/1471-2105-8-S2-S10.
8
BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark.BAliBASE 3.0:多序列比对基准测试的最新进展。
Proteins. 2005 Oct 1;61(1):127-36. doi: 10.1002/prot.20527.
9
The many faces of sequence alignment.序列比对的多种形式。
Brief Bioinform. 2005 Mar;6(1):6-22. doi: 10.1093/bib/6.1.6.
10
The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions.用于比较具有非标准组成的蛋白质的氨基酸替换矩阵的构建。
Bioinformatics. 2005 Apr 1;21(7):902-11. doi: 10.1093/bioinformatics/bti070. Epub 2004 Oct 27.