一种使用遗传编程将蛋白质片段分类为跨膜结构域的计算机程序的演化

Evolution of a computer program for classifying protein segments as transmembrane domains using genetic programming.

作者信息

Koza J R

机构信息

Computer Science Department, Stanford University, CA 94305-2140, USA.

出版信息

Proc Int Conf Intell Syst Mol Biol. 1994;2:244-52.

PMID:7584397

Abstract

The recently-developed genetic programming paradigm is used to evolve a computer program to classify a given protein segment as being a transmembrane domain or non-transmembrane area of the protein. Genetic programming starts with a primordial ooze of randomly generated computer programs composed of available programmatic ingredients and then genetically breeds the population of programs using the Darwinian principle of survival of the fittest and an analog of the naturally occurring genetic operation of crossover (sexual recombination). Automatic function definition enables genetic programming to dynamically create subroutines dynamically during the run. Genetic programming is given a training set of differently-sized protein segments and their correct classification (but no biochemical knowledge, such as hydrophobicity values). Correlation is used as the fitness measure to drive the evolutionary process. The best genetically-evolved program achieves an out-of-sample correlation of 0.968 and an out-of-sample error rate of 1.6%. This error rate is better than that reported for four other algorithms reported at the First International Conference on Intelligent Systems for Molecular Biology. Our genetically evolved program is an instance of an algorithm discovered by an automated learning paradigm that is superior to that written by human investigators.

摘要

最近开发的遗传编程范式被用于演化一个计算机程序，以将给定的蛋白质片段分类为该蛋白质的跨膜结构域或非跨膜区域。遗传编程从由可用编程成分组成的随机生成的计算机程序的原始汤开始，然后使用适者生存的达尔文原理和交叉（有性重组）这一自然发生的遗传操作的类似物对程序群体进行遗传培育。自动函数定义使遗传编程能够在运行期间动态地动态创建子例程。遗传编程被给予一组不同大小的蛋白质片段及其正确分类的训练集（但没有生化知识，如疏水性值）。相关性被用作适应度度量来驱动进化过程。最佳的遗传演化程序实现了0.968的样本外相关性和1.6%的样本外错误率。这个错误率优于在第一届国际分子生物学智能系统会议上报道的其他四种算法的错误率。我们的遗传演化程序是由一种自动学习范式发现的算法的一个实例，该算法优于人类研究者编写的算法。