Wu C, Shivakumar S
Department of Epidemiology/Biomathematics, University of Texas Health Center at Tyler 75710.
Nucleic Acids Res. 1994 Oct 11;22(20):4291-9. doi: 10.1093/nar/22.20.4291.
A neural network system has been developed for rapid and accurate classification of ribosomal RNA sequences according to phylogenetic relationship. The molecular sequences are encoded into neural input vectors using an n-gram hashing method. A SVD (singular value decomposition) method is used to compress and reduce the size of long and sparse n-gram input vectors. The neural networks used are three-layered, feed-forward networks that employ supervised learning paradigms, including the back-propagation algorithm and a modified counter-propagation algorithm. A pedagogical pattern selection strategy is used to reduce the training time. After trained with ribosomal RNA sequences of the RDP (Ribosomal Database Project) database, the system can classify query sequences into more than one hundred phylogenetic classes with a 100% accuracy at a rate of less than 0.3 CPU second per sequence on a workstation. When compared to other sequence similarity search methods, including Similarity Rank, Blast and Fasta, the neural network method has a higher classification accuracy at a speed of about an order of magnitude faster. The software tool will be made available to the biology community, and the system may be extended into a gene identification system for classifying indiscriminately sequenced DNA fragments.
已开发出一种神经网络系统,用于根据系统发育关系对核糖体RNA序列进行快速准确的分类。使用n元语法哈希方法将分子序列编码为神经输入向量。奇异值分解(SVD)方法用于压缩和减小长而稀疏的n元语法输入向量的大小。所使用的神经网络是三层前馈网络,采用监督学习范式,包括反向传播算法和改进的对向传播算法。采用一种教学模式选择策略来减少训练时间。在用核糖体数据库项目(RDP)数据库的核糖体RNA序列进行训练后,该系统能够在工作站上以每秒每个序列小于0.3个CPU秒的速度将查询序列分类到一百多个系统发育类别中,准确率达到100%。与其他序列相似性搜索方法(包括相似性排名、Blast和Fasta)相比,神经网络方法在速度快约一个数量级的情况下具有更高的分类准确率。该软件工具将提供给生物学界,并且该系统可能会扩展为一个基因识别系统,用于对未经区分测序的DNA片段进行分类。