• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

惊喜的单调与对生僻词汇的大规模探寻。

Monotony of surprise and large-scale quest for unusual words.

作者信息

Apostolico Alberto, Bock Mary Ellen, Lonardi Stefano

机构信息

Department of Computer Sciences, Purdue University, West Lafayette, IN 47907, USA.

出版信息

J Comput Biol. 2003;10(3-4):283-311. doi: 10.1089/10665270360688020.

DOI:10.1089/10665270360688020
PMID:12935329
Abstract

The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In molecular biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subword of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models.

摘要

为了压缩数据、揭示结构、推断简洁描述、提取和分类特征等,人们以各种方式研究表征和检测诸如子串或基序等重复序列模式以及相关关联或规则的问题。在分子生物学中,生物序列中异常频繁或罕见的单词与生物功能和结构的各个方面都有关联。此类模式的发现,尤其是大规模的发现,带来了有趣的方法学和算法问题,并且常常揭示出表格和概要比它们旨在概括的原始序列增长得更快、更大的情况。在先前的研究中,简洁地计算、存储和显示异常子串的能力与单词子词的组合学和用于衡量与预期偏差的某些分数的局部单调性之间的微妙相互作用有关。在本文中,我们对更广泛的各种分数的此类单调性进行了广泛分析。这支持了能够在各种概率模型下,以与主题序列成线性的时间和空间对异常子串进行全局检测的数据结构和算法的构建。

相似文献

1
Monotony of surprise and large-scale quest for unusual words.惊喜的单调与对生僻词汇的大规模探寻。
J Comput Biol. 2003;10(3-4):283-311. doi: 10.1089/10665270360688020.
2
Fast model-based protein homology detection without alignment.基于快速模型的无需比对的蛋白质同源性检测。
Bioinformatics. 2007 Jul 15;23(14):1728-36. doi: 10.1093/bioinformatics/btm247. Epub 2007 May 8.
3
Using Markov model to improve word normalization algorithm for biological sequence comparison.使用马尔可夫模型改进生物序列比对的词法归一化算法。
Amino Acids. 2012 May;42(5):1867-77. doi: 10.1007/s00726-011-0906-2. Epub 2011 Apr 20.
4
Efficient detection of unusual words.高效检测异常词汇。
J Comput Biol. 2000 Feb-Apr;7(1-2):71-94. doi: 10.1089/10665270050081397.
5
Weighted relative entropy for alignment-free sequence comparison based on Markov model.基于马尔可夫模型的无比对序列比对的加权相对熵。
J Biomol Struct Dyn. 2011 Feb;28(4):545-55. doi: 10.1080/07391102.2011.10508594.
6
Identification of Words in Biological Sequences Under the Semi-Markov Hypothesis.半马尔可夫假设下生物序列中单词的识别
J Comput Biol. 2020 May;27(5):683-697. doi: 10.1089/cmb.2019.0253. Epub 2019 Sep 23.
7
Discriminative motifs.鉴别基序
J Comput Biol. 2003;10(3-4):599-615. doi: 10.1089/10665270360688219.
8
Discovering sequence motifs.发现序列基序。
Methods Mol Biol. 2008;452:231-51. doi: 10.1007/978-1-60327-159-2_12.
9
Detecting correlations among functional-sequence motifs.检测功能序列基序之间的相关性。
Phys Rev E Stat Nonlin Soft Matter Phys. 2012 Jun;85(6 Pt 2):066124. doi: 10.1103/PhysRevE.85.066124. Epub 2012 Jun 19.
10
Picking alignments from (Steiner) trees.从(斯坦纳)树中选取比对。
J Comput Biol. 2003;10(3-4):509-20. doi: 10.1089/10665270360688156.

引用本文的文献

1
On avoided words, absent words, and their application to biological sequence analysis.论避免出现的词、缺失的词及其在生物序列分析中的应用。
Algorithms Mol Biol. 2017 Mar 14;12:5. doi: 10.1186/s13015-017-0094-z. eCollection 2017.
2
Efficient algorithms for the discovery of gapped factors.用于发现间隔因子的高效算法。
Algorithms Mol Biol. 2011 Mar 23;6:5. doi: 10.1186/1748-7188-6-5.
3
Peptide vocabulary analysis reveals ultra-conservation and homonymity in protein sequences.肽词汇分析揭示了蛋白质序列中的超保守性和同音性。
Bioinform Biol Insights. 2009 Nov 24;1:101-26. doi: 10.4137/bbi.s415.
4
Metagenome fragment classification using N-mer frequency profiles.使用N-mer频率谱进行宏基因组片段分类。
Adv Bioinformatics. 2008;2008:205969. doi: 10.1155/2008/205969. Epub 2008 Nov 16.
5
Detecting seeded motifs in DNA sequences.检测DNA序列中的种子基序。
Nucleic Acids Res. 2005 Sep 1;33(15):e135. doi: 10.1093/nar/gni131.
6
A multistep bioinformatic approach detects putative regulatory elements in gene promoters.一种多步骤生物信息学方法可检测基因启动子中的假定调控元件。
BMC Bioinformatics. 2005 May 18;6:121. doi: 10.1186/1471-2105-6-121.