Aires-de-Sousa João, Aires-de-Sousa Luisa
Departamento de Química, CQFB, campus Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, Quinta da Torre, 2829-516 Monte de Caparica Clínica, Portugal.
Bioinformatics. 2003 Jan;19(1):30-6. doi: 10.1093/bioinformatics/19.1.30.
We propose representing individual positions in DNA sequences by virtual potentials generated by other bases of the same sequence. This is a compact representation of the neighbourhood of a base. The distribution of the virtual potentials over the whole sequence can be used as a representation of the entire sequence (SEQREP code). It is a flexible code, with a length independent of the sequence size, does not require previous alignment, and is convenient for processing by neural networks or statistical techniques.
To evaluate its biological significance, the SEQREP code was used for training Kohonen self-organizing maps (SOMs) in two applications: (a) detection of Alu sequences, and (b) classification of sequences encoding for HIV-1 envelope glycoprotein (env) into subtypes A-G. It was demonstrated that SOMs clustered sequences belonging to different classes into distinct regions. For independent test sets, very high rates of correct predictions were obtained (97% in the first application, 91% in the second). Possible areas of application of SEQREP codes include functional genomics, phylogenetic analysis, detection of repetitions, database retrieval, and automatic alignment.
Software for representing sequences by SEQREP code, and for training Kohonen SOMs is made freely available from http://www.dq.fct.unl.pt/qoa/jas/seqrep.
Supplementary material is available at http://www.dq.fct.unl.pt/qoa/jas/seqrep/bioinf2002
我们建议通过同一序列中其他碱基产生的虚拟势来表示DNA序列中的各个位置。这是一种对碱基邻域的紧凑表示。虚拟势在整个序列上的分布可用于表示整个序列(SEQREP编码)。它是一种灵活的编码,长度与序列大小无关,不需要预先比对,便于神经网络或统计技术进行处理。
为了评估其生物学意义,SEQREP编码被用于在两个应用中训练Kohonen自组织映射(SOM):(a)检测Alu序列,以及(b)将编码HIV-1包膜糖蛋白(env)的序列分类为A - G亚型。结果表明,SOM将属于不同类别的序列聚类到不同区域。对于独立测试集,获得了非常高的正确预测率(第一个应用中为97%,第二个应用中为91%)。SEQREP编码的可能应用领域包括功能基因组学、系统发育分析、重复检测、数据库检索和自动比对。
用于通过SEQREP编码表示序列以及训练Kohonen SOM的软件可从http://www.dq.fct.unl.pt/qoa/jas/seqrep免费获取。