Friedel Caroline C, Jahn Katharina H V, Sommer Selina, Rudd Stephen, Mewes Hans W, Tetko Igor V
Institut fuer Informatik, Ludwig-Maximilians-Universitaet Muenchen, Oettingenstrasse 67, 80538 Muenchen, Germany.
Bioinformatics. 2005 Apr 15;21(8):1383-8. doi: 10.1093/bioinformatics/bti200. Epub 2004 Dec 7.
Discovery of host and pathogen genes expressed at the plant-pathogen interface often requires the construction of mixed libraries that contain sequences from both genomes. Sequence identification requires high-throughput and reliable classification of genome origin. When using single-pass cDNA sequences difficulties arise from the short sequence length, the lack of sufficient taxonomically relevant sequence data in public databases and ambiguous sequence homology between plant and pathogen genes.
A novel method is described, which is independent of the availability of homologous genes and relies on subtle differences in codon usage between plant and fungal genes. We used support vector machines (SVMs) to identify the probable origin of sequences. SVMs were compared to several other machine learning techniques and to a probabilistic algorithm (PF-IND) for expressed sequence tag (EST) classification also based on codon bias differences. Our software (Eclat) has achieved a classification accuracy of 93.1% on a test set of 3217 EST sequences from Hordeum vulgare and Blumeria graminis, which is a significant improvement compared to PF-IND (prediction accuracy of 81.2% on the same test set). EST sequences with at least 50 nt of coding sequence can be classified using Eclat with high confidence. Eclat allows training of classifiers for any host-pathogen combination for which there are sufficient classified training sequences.
Eclat is freely available on the Internet (http://mips.gsf.de/proj/est) or on request as a standalone version.
发现植物 - 病原体界面处表达的宿主和病原体基因通常需要构建包含来自两个基因组序列的混合文库。序列鉴定需要对基因组来源进行高通量且可靠的分类。当使用单通道cDNA序列时,由于序列长度短、公共数据库中缺乏足够的分类学相关序列数据以及植物和病原体基因之间模糊的序列同源性,会出现困难。
描述了一种新方法,该方法不依赖同源基因的可用性,而是依赖于植物和真菌基因密码子使用上的细微差异。我们使用支持向量机(SVM)来识别序列的可能来源。将支持向量机与其他几种机器学习技术以及一种同样基于密码子偏好差异的用于表达序列标签(EST)分类的概率算法(PF - IND)进行了比较。我们的软件(Eclat)在一组来自大麦和禾本科布氏白粉菌的3217个EST序列测试集上实现了93.1%的分类准确率,与PF - IND(在同一测试集上的预测准确率为81.2%)相比有显著提高。使用Eclat可以对至少有50个核苷酸编码序列的EST序列进行高置信度分类。Eclat允许针对任何有足够分类训练序列的宿主 - 病原体组合训练分类器。
Eclat可在互联网上免费获取(http://mips.gsf.de/proj/est)或应要求作为独立版本提供。
friedel@informatik.uni - muenchen.de