Roche-Lima Abiel, Domaratzki Michael, Fristensky Brian
Department of Computer Science, University of Manitoba, Winnipeg, Manitoba, Canada.
BMC Bioinformatics. 2014 Sep 26;15(1):318. doi: 10.1186/1471-2105-15-318.
Metabolic networks are represented by the set of metabolic pathways. Metabolic pathways are a series of biochemical reactions, in which the product (output) from one reaction serves as the substrate (input) to another reaction. Many pathways remain incompletely characterized. One of the major challenges of computational biology is to obtain better models of metabolic pathways. Existing models are dependent on the annotation of the genes. This propagates error accumulation when the pathways are predicted by incorrectly annotated genes. Pairwise classification methods are supervised learning methods used to classify new pair of entities. Some of these classification methods, e.g., Pairwise Support Vector Machines (SVMs), use pairwise kernels. Pairwise kernels describe similarity measures between two pairs of entities. Using pairwise kernels to handle sequence data requires long processing times and large storage. Rational kernels are kernels based on weighted finite-state transducers that represent similarity measures between sequences or automata. They have been effectively used in problems that handle large amount of sequence information such as protein essentiality, natural language processing and machine translations.
We create a new family of pairwise kernels using weighted finite-state transducers (called Pairwise Rational Kernel (PRK)) to predict metabolic pathways from a variety of biological data. PRKs take advantage of the simpler representations and faster algorithms of transducers. Because raw sequence data can be used, the predictor model avoids the errors introduced by incorrect gene annotations. We then developed several experiments with PRKs and Pairwise SVM to validate our methods using the metabolic network of Saccharomyces cerevisiae. As a result, when PRKs are used, our method executes faster in comparison with other pairwise kernels. Also, when we use PRKs combined with other simple kernels that include evolutionary information, the accuracy values have been improved, while maintaining lower construction and execution times.
The power of using kernels is that almost any sort of data can be represented using kernels. Therefore, completely disparate types of data can be combined to add power to kernel-based machine learning methods. When we compared our proposal using PRKs with other similar kernel, the execution times were decreased, with no compromise of accuracy. We also proved that by combining PRKs with other kernels that include evolutionary information, the accuracy can also also be improved. As our proposal can use any type of sequence data, genes do not need to be properly annotated, avoiding accumulation errors because of incorrect previous annotations.
代谢网络由一系列代谢途径表示。代谢途径是一系列生化反应,其中一个反应的产物(输出)作为另一个反应的底物(输入)。许多途径的特征仍不完全清楚。计算生物学的主要挑战之一是获得更好的代谢途径模型。现有模型依赖于基因注释。当通过错误注释的基因预测途径时,这会导致错误积累。成对分类方法是用于对新的实体对进行分类的监督学习方法。其中一些分类方法,例如成对支持向量机(SVM),使用成对核。成对核描述了两对实体之间的相似性度量。使用成对核处理序列数据需要较长的处理时间和大量存储。有理核是基于加权有限状态变换器的核,用于表示序列或自动机之间的相似性度量。它们已有效地应用于处理大量序列信息的问题,如蛋白质必需性、自然语言处理和机器翻译。
我们使用加权有限状态变换器创建了一个新的成对核家族(称为成对有理核(PRK)),以从各种生物数据预测代谢途径。PRK利用了变换器更简单的表示和更快的算法。由于可以使用原始序列数据,预测模型避免了因基因注释错误而引入的误差。然后,我们使用PRK和成对SVM进行了几个实验,以使用酿酒酵母的代谢网络验证我们的方法。结果,当使用PRK时,我们的方法与其他成对核相比执行速度更快。此外,当我们将PRK与其他包含进化信息的简单核结合使用时,准确率有所提高,同时保持较低的构建和执行时间。
使用核的强大之处在于几乎任何类型的数据都可以用核来表示。因此,完全不同类型的数据可以组合起来,为基于核的机器学习方法增添力量。当我们将使用PRK的提议与其他类似核进行比较时,执行时间减少,而不影响准确性。我们还证明,通过将PRK与其他包含进化信息的核结合,可以提高准确率。由于我们的提议可以使用任何类型的序列数据,因此不需要对基因进行正确注释,避免了由于先前错误注释而导致的误差积累。