Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001 Leuven, Belgium.
BMC Bioinformatics. 2010 Jan 2;11:2. doi: 10.1186/1471-2105-11-2.
S. cerevisiae, A. thaliana and M. musculus are well-studied organisms in biology and the sequencing of their genomes was completed many years ago. It is still a challenge, however, to develop methods that assign biological functions to the ORFs in these genomes automatically. Different machine learning methods have been proposed to this end, but it remains unclear which method is to be preferred in terms of predictive performance, efficiency and usability.
We study the use of decision tree based models for predicting the multiple functions of ORFs. First, we describe an algorithm for learning hierarchical multi-label decision trees. These can simultaneously predict all the functions of an ORF, while respecting a given hierarchy of gene functions (such as FunCat or GO). We present new results obtained with this algorithm, showing that the trees found by it exhibit clearly better predictive performance than the trees found by previously described methods. Nevertheless, the predictive performance of individual trees is lower than that of some recently proposed statistical learning methods. We show that ensembles of such trees are more accurate than single trees and are competitive with state-of-the-art statistical learning and functional linkage methods. Moreover, the ensemble method is computationally efficient and easy to use.
Our results suggest that decision tree based methods are a state-of-the-art, efficient and easy-to-use approach to ORF function prediction.
酿酒酵母、拟南芥和小家鼠是生物学中研究得很好的生物,它们的基因组测序多年前就已经完成。然而,开发能够自动为这些基因组中的 ORF 分配生物功能的方法仍然是一个挑战。为此已经提出了不同的机器学习方法,但在预测性能、效率和可用性方面,哪种方法更具优势仍不清楚。
我们研究了基于决策树的模型在预测 ORF 的多种功能方面的应用。首先,我们描述了一种用于学习层次多标签决策树的算法。这些树可以同时预测 ORF 的所有功能,同时尊重给定的基因功能层次结构(如 FunCat 或 GO)。我们展示了该算法获得的新结果,表明它找到的树表现出明显更好的预测性能,优于以前描述的方法找到的树。然而,个别树的预测性能低于最近提出的一些统计学习方法。我们表明,这些树的集成比单个树更准确,并且与最先进的统计学习和功能链接方法具有竞争力。此外,集成方法计算效率高且易于使用。
我们的结果表明,基于决策树的方法是一种先进、高效且易于使用的 ORF 功能预测方法。