• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

肽词汇分析揭示了蛋白质序列中的超保守性和同音性。

Peptide vocabulary analysis reveals ultra-conservation and homonymity in protein sequences.

作者信息

Gatherer Derek

机构信息

MRC Virology Unit, Institute of Virology, Church Street, Glasgow G11 5JR UK.

出版信息

Bioinform Biol Insights. 2009 Nov 24;1:101-26. doi: 10.4137/bbi.s415.

DOI:10.4137/bbi.s415
PMID:20066129
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2789693/
Abstract

A new algorithm is presented for vocabulary analysis (word detection) in texts of human origin. It performs at 60%-70% overall accuracy and greater than 80% accuracy for longer words, and approximately 85% sensitivity on Alice in Wonderland, a considerable improvement on previous methods. When applied to protein sequences, it detects short sequences analogous to words in human texts, i.e. intolerant to changes in spelling (mutation), and relatively context-independent in their meaning (function). Some of these are homonyms of up to 7 amino acids, which can assume different structures in different proteins. Others are ultra-conserved stretches of up to 18 amino acids within proteins of less than 40% overall identity, reflecting extreme constraint or convergent evolution. Different species are found to have qualitatively different major peptide vocabularies, e.g. some are dominated by large gene families, while others are rich in simple repeats or dominated by internally repetitive proteins. This suggests the possibility of a peptide vocabulary signature, analogous to genome signatures in DNA. Homonyms may be useful in detecting convergent evolution and positive selection in protein evolution. Ultra-conserved words may be useful in identifying structures intolerant to substitution over long periods of evolutionary time.

摘要

本文提出了一种新算法,用于对源自人类的文本进行词汇分析(单词检测)。该算法总体准确率为60%-70%,对于较长单词的准确率超过80%,在《爱丽丝梦游仙境》上的灵敏度约为85%,相比之前的方法有显著改进。当应用于蛋白质序列时,它能检测出类似于人类文本中单词的短序列,即对拼写变化(突变)不宽容,且其含义(功能)相对独立于上下文。其中一些是长达7个氨基酸的同音异义词,在不同蛋白质中可呈现不同结构。其他的是在整体一致性低于40%的蛋白质中长达18个氨基酸的超保守片段,反映了极端限制或趋同进化。研究发现不同物种具有质的不同的主要肽词汇,例如,一些由大型基因家族主导,而另一些则富含简单重复序列或由内部重复蛋白质主导。这表明存在肽词汇特征的可能性,类似于DNA中的基因组特征。同音异义词可能有助于检测蛋白质进化中的趋同进化和正选择。超保守单词可能有助于识别在长时间进化过程中不耐受替换的结构。

相似文献

1
Peptide vocabulary analysis reveals ultra-conservation and homonymity in protein sequences.肽词汇分析揭示了蛋白质序列中的超保守性和同音性。
Bioinform Biol Insights. 2009 Nov 24;1:101-26. doi: 10.4137/bbi.s415.
2
Word type and modality in the emerging expressive vocabularies of preschool children with Down syndrome.唐氏综合征学龄前儿童新兴表达词汇的词类和模态。
Int J Lang Commun Disord. 2023 May;58(3):864-878. doi: 10.1111/1460-6984.12828. Epub 2022 Dec 20.
3
Construction and improvement of English vocabulary learning model integrating spiking neural network and convolutional long short-term memory algorithm.构建和改进集成尖峰神经网络和卷积长短期记忆算法的英语词汇学习模型。
PLoS One. 2024 Mar 22;19(3):e0299425. doi: 10.1371/journal.pone.0299425. eCollection 2024.
4
The lexical profile of forestry academic texts: What does it take to understand a specialized discipline?林业学术文本的词汇概况:理解一门专业学科需要具备什么?
PLoS One. 2024 Dec 30;19(12):e0315975. doi: 10.1371/journal.pone.0315975. eCollection 2024.
5
A comparison of homonym and novel word learning: the role of phonotactic probability and word frequency.同音词与新单词学习的比较:音位结构概率和词频的作用。
J Child Lang. 2005 Nov;32(4):827-53. doi: 10.1017/s0305000905007099.
6
Children's interpretations of homonyms: a developmental study.儿童对同音异义词的理解:一项发展性研究。
J Child Lang. 1997 Jun;24(2):441-67. doi: 10.1017/s0305000997003103.
7
Novel multigene families encoding highly repetitive peptide sequences. Sequence analyses of rat and mouse proline-rich protein cDNAs.编码高度重复肽序列的新型多基因家族。大鼠和小鼠富含脯氨酸蛋白cDNA的序列分析。
J Biol Chem. 1985 Nov 5;260(25):13471-7.
8
Evolution and structural conservation of the control region of insect mitochondrial DNA.昆虫线粒体DNA控制区的进化与结构保守性
J Mol Evol. 1995 Apr;40(4):382-91. doi: 10.1007/BF00164024.
9
Repetitive DNA in eukaryotic genomes.真核生物基因组中的重复DNA
Chromosome Res. 2015 Sep;23(3):415-20. doi: 10.1007/s10577-015-9499-z.
10
Taxonomic Distribution, Phylogenetic Relationship, and Domain Conservation of CRISPR-Associated Cas Proteins.成簇规律间隔短回文重复序列(CRISPR)相关Cas蛋白的分类分布、系统发育关系及结构域保守性
Bioinform Biol Insights. 2024 Oct 5;18:11779322241274961. doi: 10.1177/11779322241274961. eCollection 2024.

本文引用的文献

1
Multiple hypothesis testing to detect lineages under positive selection that affects only a few sites.多重假设检验以检测仅影响少数位点的正选择谱系。
Mol Biol Evol. 2007 May;24(5):1219-28. doi: 10.1093/molbev/msm042. Epub 2007 Mar 5.
2
Word organization in coding DNA: a mathematical model.编码DNA中的单词组织:一种数学模型。
Theory Biosci. 2006 Aug;125(1):1-17. doi: 10.1016/j.thbio.2006.03.002. Epub 2006 Apr 27.
3
ScanProsite: detection of PROSITE signature matches and ProRule-associated functional and structural residues in proteins.
ScanProsite:检测蛋白质中PROSITE特征匹配以及与ProRule相关的功能和结构残基。
Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W362-5. doi: 10.1093/nar/gkl124.
4
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.Cd-hit:一个用于对大量蛋白质或核苷酸序列进行聚类和比较的快速程序。
Bioinformatics. 2006 Jul 1;22(13):1658-9. doi: 10.1093/bioinformatics/btl158. Epub 2006 May 26.
5
Oligonucleotide frequencies in DNA follow a Yule distribution.DNA中的寡核苷酸频率遵循尤尔分布。
Comput Chem. 1996 Mar;20(1):35-8. doi: 10.1016/0097-8485(95)00091-7.
6
Protein linguistics - a grammar for modular protein assembly?蛋白质语言学——模块化蛋白质组装的语法规则?
Nat Rev Mol Cell Biol. 2006 Jan;7(1):68-73. doi: 10.1038/nrm1785.
7
Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures.用于比对和预测假结RNA结构的成对随机树邻接文法
Proc IEEE Comput Syst Bioinform Conf. 2004:290-9.
8
Pfam: clans, web tools and services.蛋白质家族数据库(Pfam):家族分类、网络工具及服务
Nucleic Acids Res. 2006 Jan 1;34(Database issue):D247-51. doi: 10.1093/nar/gkj149.
9
Grammatical inference in bioinformatics.生物信息学中的语法推断
IEEE Trans Pattern Anal Mach Intell. 2005 Jul;27(7):1051-62. doi: 10.1109/TPAMI.2005.140.
10
WordSpy: identifying transcription factor binding motifs by building a dictionary and learning a grammar.WordSpy:通过构建词典和学习语法来识别转录因子结合基序。
Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W412-6. doi: 10.1093/nar/gki492.