Nakai K, Kanehisa M
Institute for Chemical Research, Kyoto University, Japan.
Genomics. 1992 Dec;14(4):897-911. doi: 10.1016/s0888-7543(05)80111-9.
To automate examination of massive amounts of sequence data for biological function, it is important to computerize interpretation based on empirical knowledge of sequence-function relationships. For this purpose, we have been constructing a knowledge base by organizing various experimental and computational observations as a collection of if-then rules. Here we report an expert system, which utilizes this knowledge base, for predicting localization sites of proteins only from the information on the amino acid sequence and the source origin. We collected data for 401 eukaryotic proteins with known localization sites (subcellular and extracellular) and divided them into training data and testing data. Fourteen localization sites were distinguished for animal cells and 17 for plant cells. When sorting signals were not well characterized experimentally, various sequence features were computationally derived from the training data. It was found that 66% of the training data and 59% of the testing data were correctly predicted by our expert system. This artificial intelligence approach is powerful and flexible enough to be used in genome analyses.
为了实现对大量序列数据进行生物功能的自动化检测,基于序列 - 功能关系的经验知识进行计算机化解读非常重要。为此,我们通过将各种实验和计算观察结果组织成一系列“如果 - 那么”规则来构建一个知识库。在此,我们报告一个利用该知识库的专家系统,它仅根据氨基酸序列信息和来源就能预测蛋白质的定位位点。我们收集了401个具有已知定位位点(亚细胞和细胞外)的真核生物蛋白质的数据,并将它们分为训练数据和测试数据。动物细胞区分出14个定位位点,植物细胞区分出17个定位位点。当分选信号在实验上没有得到很好的表征时,从训练数据中通过计算得出各种序列特征。结果发现,我们的专家系统正确预测了66%的训练数据和59%的测试数据。这种人工智能方法强大且灵活,足以用于基因组分析。