Kunik Vered, Solan Zach, Edelman Shimon, Ruppin Eytan, Horn David
School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel.
Proc IEEE Comput Syst Bioinform Conf. 2005:80-5. doi: 10.1109/csb.2005.39.
We present a novel unsupervised method for extracting meaningful motifs from biological sequence data. This de novo motif extraction (MEX) algorithm is data driven, finding motifs that are not necessarily over-represented in the data. Applying MEX to the oxidoreductases class of enzymes, containing approximately 7000 enzyme sequences, a relatively small set of motifs is obtained. This set spans a motif-space that is used for functional classification of the enzymes by an SVM classifier. The classification based on MEX motifs surpasses that of two other SVM based methods: SVMProt, a method based on the analysis of physical-chemical properties of a protein generated from its sequence of amino acids, and SVM applied to a Smith-Waterman distances matrix. Our findings demonstrate that the MEX algorithm extracts relevant motifs, supporting a successful sequence-to-function classification.
我们提出了一种从生物序列数据中提取有意义基序的新型无监督方法。这种从头基序提取(MEX)算法是数据驱动的,能够找到在数据中不一定过度呈现的基序。将MEX应用于包含约7000个酶序列的氧化还原酶类,获得了一组相对较小的基序。该组基序跨越一个基序空间,用于通过支持向量机(SVM)分类器对酶进行功能分类。基于MEX基序的分类超过了其他两种基于SVM的方法:SVMProt,一种基于对由氨基酸序列生成的蛋白质的物理化学性质进行分析的方法,以及应用于史密斯-沃特曼距离矩阵的SVM。我们的研究结果表明,MEX算法提取了相关基序,支持了成功的序列到功能分类。