Brameier Markus, Haan Josien, Krings Andrea, MacCallum Robert M
Stockholm Bioinformatics Center, Stockholm University, 106 91 Stockholm, Sweden.
BMC Bioinformatics. 2006 Jan 12;7:16. doi: 10.1186/1471-2105-7-16.
Methods for predicting protein function directly from amino acid sequences are useful tools in the study of uncharacterized protein families and in comparative genomics. Until now, this problem has been approached using machine learning techniques that attempt to predict membership, or otherwise, to predefined functional categories or subcellular locations. A potential drawback of this approach is that the human-designated functional classes may not accurately reflect the underlying biology, and consequently important sequence-to-function relationships may be missed.
We show that a self-supervised data mining approach is able to find relationships between sequence features and functional annotations. No preconceived ideas about functional categories are required, and the training data is simply a set of protein sequences and their UniProt/Swiss-Prot annotations. The main technical aspect of the approach is the co-evolution of amino acid-based regular expressions and keyword-based logical expressions with genetic programming. Our experiments on a strictly non-redundant set of eukaryotic proteins reveal that the strongest and most easily detected sequence-to-function relationships are concerned with targeting to various cellular compartments, which is an area already well studied both experimentally and computationally. Of more interest are a number of broad functional roles which can also be correlated with sequence features. These include inhibition, biosynthesis, transcription and defence against bacteria. Despite substantial overlaps between these functions and their corresponding cellular compartments, we find clear differences in the sequence motifs used to predict some of these functions. For example, the presence of polyglutamine repeats appears to be linked more strongly to the "transcription" function than to the general "nuclear" function/location.
We have developed a novel and useful approach for knowledge discovery in annotated sequence data. The technique is able to identify functionally important sequence features and does not require expert knowledge. By viewing protein function from a sequence perspective, the approach is also suitable for discovering unexpected links between biological processes, such as the recently discovered role of ubiquitination in transcription.
直接从氨基酸序列预测蛋白质功能的方法是研究未表征蛋白质家族和比较基因组学中的有用工具。到目前为止,这个问题一直通过机器学习技术来解决,这些技术试图预测蛋白质是否属于预定义的功能类别或亚细胞定位。这种方法的一个潜在缺点是,人为指定的功能类别可能无法准确反映潜在的生物学特性,因此可能会错过重要的序列与功能关系。
我们表明,一种自监督数据挖掘方法能够找到序列特征与功能注释之间的关系。不需要对功能类别有先入为主的想法,训练数据只是一组蛋白质序列及其UniProt/Swiss-Prot注释。该方法的主要技术方面是基于氨基酸的正则表达式和基于关键词的逻辑表达式与遗传编程的共同进化。我们在一组严格非冗余的真核蛋白质上进行的实验表明,最强且最容易检测到的序列与功能关系与靶向各种细胞区室有关,这是一个在实验和计算方面都已得到充分研究的领域。更有趣的是,一些广泛的功能作用也可以与序列特征相关联。这些功能包括抑制、生物合成、转录以及对细菌的防御。尽管这些功能与其相应的细胞区室之间存在大量重叠,但我们发现用于预测其中一些功能的序列基序存在明显差异。例如,聚谷氨酰胺重复序列的存在似乎与“转录”功能的联系比与一般的“核”功能/定位的联系更为紧密。
我们开发了一种新颖且有用的方法用于在注释序列数据中发现知识。该技术能够识别功能上重要的序列特征,并且不需要专家知识。通过从序列角度看待蛋白质功能,该方法也适用于发现生物过程之间意想不到的联系,例如最近发现的泛素化在转录中的作用。