用于快速灵敏蛋白质家族搜索的基序识别神经设计

Motif identification neural design for rapid and sensitive protein family search.

作者信息

Wu C H, Zhao S, Chen H L, Lo C J, McLarty J

机构信息

Department of Epidemiology/Biomathematics, University of Texas Health Center at Tyler 75710, USA.

出版信息

Comput Appl Biosci. 1996 Apr;12(2):109-18. doi: 10.1093/bioinformatics/12.2.109.

DOI:10.1093/bioinformatics/12.2.109

PMID:8744773

Abstract

A new method, the motif identification neural design (MOTIFIND), has been developed for rapid and sensitive protein family identification. The method is an extension of our previous gene classification artificial neural system and employs new designs to enhance the detection of distant relationships. The new designs include an n-gram term weighting algorithm for extracting local motif patterns, an enhanced n-gram method for extracting residues of long-range correlation, and integrated neural networks for combining global and motif sequence information. The system has been tested and compared with several existing methods using three protein families, the cytochrome c, cytochrome b and flavodoxin. Overall it achieves 100% sensitivity and > 99.6% specificity, an accuracy comparable to BLAST, but at a speed of approximately 20 times faster. The system is much more robust than the PROSITE search which is based on simple signature patterns. MOTIFIND also compares favorably with BLIMPS, the Hidden Markov Model and PROFILESEARCH in detecting fragmentary sequences lacking complete motif regions and in detecting distant relationships, especially for members of under-represented subgroups within a family. MOTIFIND may be generally applicable to other proteins and has the potential to become a full-scale database search and sequence analysis tool.

摘要

一种名为基序识别神经设计（MOTIFIND）的新方法已被开发出来，用于快速且灵敏地识别蛋白质家族。该方法是我们之前基因分类人工神经系统的扩展，并采用了新设计来加强对远亲关系的检测。新设计包括用于提取局部基序模式的n元语法词加权算法、用于提取长程相关性残基的增强型n元语法方法，以及用于结合全局和基序序列信息的集成神经网络。该系统已使用细胞色素c、细胞色素b和黄素氧还蛋白这三个蛋白质家族进行了测试，并与几种现有方法进行了比较。总体而言，它实现了100%的灵敏度和>99.6%的特异性，准确性与BLAST相当，但速度快约20倍。该系统比基于简单签名模式的PROSITE搜索更稳健。在检测缺乏完整基序区域的片段序列以及检测远亲关系方面，尤其是对于家族中代表性不足的亚组成员，MOTIFIND也优于BLIMPS、隐马尔可夫模型和PROFILESEARCH。MOTIFIND可能普遍适用于其他蛋白质，并有可能成为一个全面的数据库搜索和序列分析工具。