Clare A, King R D
Department of Computer Science, University of Wales, Aberystwyth, Penglais, Aberystwyth, Wales, UK.
Bioinformatics. 2003 Oct;19 Suppl 2:ii42-9. doi: 10.1093/bioinformatics/btg1058.
S.cerevisiae is one of the most important model organisms, and has has been the focus of over a century of study. In spite of these efforts, 40% of its open reading frames (ORFs) remain classified as having unknown function (MIPS: Munich Information Center for Protein Sequences). We wished to make predictions for the function of these ORFs using data mining, as we have previously successfully done for the genomes of M.tuberculosis and E.coli. Applying this approach to the larger and eukaryotic S.cerevisiae genome involves modifying the machine learning and data mining algorithms, as this is a larger organism with more data available, and a more challenging functional classification.
Novel extensions to the machine learning and data mining algorithms have been devised in order to deal with the challenges. Accurate rules have been learned and predictions have been made for many of the ORFs whose function is currently unknown. The rules are informative, agree with known biology and allow for scientific discovery.
All predictions are freely available from http://www.genepredictions.org, all datasets used in this study are freely available from http://www.aber.ac.uk/compsci/Research/bio/dss/yeastdataand software for relational data mining is available from http://www.aber.ac.uk/compsci/Research/bio/dss/polyfarm.
酿酒酵母是最重要的模式生物之一,也是一个多世纪以来的研究焦点。尽管人们付出了诸多努力,但其40%的开放阅读框(ORF)仍被归类为功能未知(MIPS:慕尼黑蛋白质序列信息中心)。我们希望利用数据挖掘对这些ORF的功能进行预测,就像我们之前在结核分枝杆菌和大肠杆菌基因组研究中成功做到的那样。将这种方法应用于更大的真核酿酒酵母基因组,需要对机器学习和数据挖掘算法进行修改,因为这是一个具有更多可用数据且功能分类更具挑战性的更大生物体。
为应对这些挑战,我们设计了机器学习和数据挖掘算法的新扩展。已经学习到了准确的规则,并对许多功能目前未知的ORF进行了预测。这些规则信息丰富,与已知生物学知识相符,并有助于科学发现。