Eskin E, Grundy W N, Singer Y
Department of Computer Science, Columbia University, USA.
Proc Int Conf Intell Syst Mol Biol. 2000;8:134-45.
In this paper we present a method for classifying proteins into families using sparse Markov transducers (SMTs). Sparse Markov transducers, similar to probabilistic suffix trees, estimate a probability distribution conditioned on an input sequence. SMTs generalize probabilistic suffix trees by allowing for wild-cards in the conditioning sequences. Because substitutions of amino acids are common in protein families, incorporating wildcards into the model significantly improves classification performance. We present two models for building protein family classifiers using SMTs. We also present efficient data structures to improve the memory usage of the models. We evaluate SMTs by building protein family classifiers using the Pfam database and compare our results to previously published results.
在本文中,我们提出了一种使用稀疏马尔可夫变换器(SMT)将蛋白质分类到家族中的方法。稀疏马尔可夫变换器与概率后缀树类似,可根据输入序列估计概率分布。SMT通过在条件序列中允许通配符来推广概率后缀树。由于氨基酸替换在蛋白质家族中很常见,因此将通配符纳入模型可显著提高分类性能。我们提出了两种使用SMT构建蛋白质家族分类器的模型。我们还提出了高效的数据结构以改善模型的内存使用情况。我们通过使用Pfam数据库构建蛋白质家族分类器来评估SMT,并将我们的结果与先前发表的结果进行比较。