Gatherer Derek
MRC Virology Unit, Institute of Virology, Church Street, Glasgow G11 5JR UK.
Bioinform Biol Insights. 2009 Nov 24;1:101-26. doi: 10.4137/bbi.s415.
A new algorithm is presented for vocabulary analysis (word detection) in texts of human origin. It performs at 60%-70% overall accuracy and greater than 80% accuracy for longer words, and approximately 85% sensitivity on Alice in Wonderland, a considerable improvement on previous methods. When applied to protein sequences, it detects short sequences analogous to words in human texts, i.e. intolerant to changes in spelling (mutation), and relatively context-independent in their meaning (function). Some of these are homonyms of up to 7 amino acids, which can assume different structures in different proteins. Others are ultra-conserved stretches of up to 18 amino acids within proteins of less than 40% overall identity, reflecting extreme constraint or convergent evolution. Different species are found to have qualitatively different major peptide vocabularies, e.g. some are dominated by large gene families, while others are rich in simple repeats or dominated by internally repetitive proteins. This suggests the possibility of a peptide vocabulary signature, analogous to genome signatures in DNA. Homonyms may be useful in detecting convergent evolution and positive selection in protein evolution. Ultra-conserved words may be useful in identifying structures intolerant to substitution over long periods of evolutionary time.
本文提出了一种新算法,用于对源自人类的文本进行词汇分析(单词检测)。该算法总体准确率为60%-70%,对于较长单词的准确率超过80%,在《爱丽丝梦游仙境》上的灵敏度约为85%,相比之前的方法有显著改进。当应用于蛋白质序列时,它能检测出类似于人类文本中单词的短序列,即对拼写变化(突变)不宽容,且其含义(功能)相对独立于上下文。其中一些是长达7个氨基酸的同音异义词,在不同蛋白质中可呈现不同结构。其他的是在整体一致性低于40%的蛋白质中长达18个氨基酸的超保守片段,反映了极端限制或趋同进化。研究发现不同物种具有质的不同的主要肽词汇,例如,一些由大型基因家族主导,而另一些则富含简单重复序列或由内部重复蛋白质主导。这表明存在肽词汇特征的可能性,类似于DNA中的基因组特征。同音异义词可能有助于检测蛋白质进化中的趋同进化和正选择。超保守单词可能有助于识别在长时间进化过程中不耐受替换的结构。