基因序列分类的特征选择

Feature selection for genetic sequence classification.

作者信息

Chuzhanova N A, Jones A J, Margetts S

机构信息

Institute of Mathematics, Siberian Branch of Russian Academy of Science, Novosibirsk, Russia.

出版信息

Bioinformatics. 1998;14(2):139-43. doi: 10.1093/bioinformatics/14.2.139.

DOI:10.1093/bioinformatics/14.2.139

PMID:9545445

Abstract

MOTIVATION

Most of the existing methods for genetic sequence classification are based on a computer search for homologies in nucleotide or amino acid sequences. The standard sequence alignment programs scale very poorly as the number of sequences increases or the degree of sequence identity is <30%. Some new computationally inexpensive methods based on nucleotide or amino acid compositional analysis have been proposed, but prediction results are still unsatisfactory and depend on the features chosen to represent the sequences.

RESULTS

In this paper, a feature selection method based on the Gamma (or near-neighbour) test is proposed. If there is a continuous or smooth map from feature space to the classification target values, the Gamma test gives an estimate for the mean-squared error of the classification, despite the fact that one has no a priori knowledge of the smooth mapping. We can search a large space of possible feature combinations for a combination which gives a smallest estimated mean-squared error using a genetic algorithm. The method was used for feature selection and classification of the large subunits of rRNA according to RDP (Ribosomal Database Project) phylogenetic classes. The sequences were represented by dinucleotide frequency distribution. The nearest-neighbour criterion has been used to estimate the predictive accuracy of the classification based on the selected features. For examples discussed, we found that the classification according to the first nearest neighbour is correct for 80% of the test samples. If we consider the set of the 10 nearest neighbours, then 94% of the test samples are classified correctly.

AVAILABILITY

The principal novel component of this method is the Gamma test and this can be downloaded compiled for Unix Sun 4, Windows 95 and MS-DOS from http://www.cs.cf.ac.uk/ec/

CONTACT

s.margetts@cs.cf.ac.uk

摘要

动机

现有的大多数基因序列分类方法都是基于计算机搜索核苷酸或氨基酸序列中的同源性。随着序列数量的增加或序列同一性程度低于30%，标准的序列比对程序扩展性很差。已经提出了一些基于核苷酸或氨基酸组成分析的计算成本较低的新方法，但预测结果仍然不尽人意，并且依赖于用于表示序列的特征。

结果

本文提出了一种基于伽马（或近邻）检验的特征选择方法。如果存在从特征空间到分类目标值的连续或平滑映射，伽马检验会给出分类的均方误差估计，尽管事先并不知道这种平滑映射。我们可以使用遗传算法在大量可能的特征组合空间中搜索，以找到给出最小估计均方误差的组合。该方法用于根据核糖体数据库项目（RDP）系统发育类别对rRNA大亚基进行特征选择和分类。序列由二核苷酸频率分布表示。最近邻准则已被用于基于所选特征估计分类的预测准确性。对于所讨论的示例，我们发现根据第一近邻进行的分类对80%的测试样本是正确的。如果考虑10个近邻的集合，那么94%的测试样本被正确分类。