Kontos Kevin, Godard Patrice, André Bruno, van Helden Jacques, Bontempi Gianluca
Machine Learning Group, Département d'Informatique, Faculté des Sciences, Université Libre de Bruxelles (ULB), Boulevard du Triomphe CP 212, 1050 Brussels, Belgium.
BMC Proc. 2008 Dec 17;2 Suppl 4(Suppl 4):S5. doi: 10.1186/1753-6561-2-s4-s5.
Nitrogen is an essential nutrient for all life forms. Like most unicellular organisms, the yeast Saccharomyces cerevisiae transports and catabolizes good nitrogen sources in preference to poor ones. Nitrogen catabolite repression (NCR) refers to this selection mechanism. All known nitrogen catabolite pathways are regulated by four regulators. The ultimate goal is to infer the complete nitrogen catabolite pathways. Bioinformatics approaches offer the possibility to identify putative NCR genes and to discard uninteresting genes.
We present a machine learning approach where the identification of putative NCR genes in the yeast Saccharomyces cerevisiae is formulated as a supervised two-class classification problem. Classifiers predict whether genes are NCR-sensitive or not from a large number of variables related to the GATA motif in the upstream non-coding sequences of the genes. The positive and negative training sets are composed of annotated NCR genes and manually-selected genes known to be insensitive to NCR, respectively. Different classifiers and variable selection methods are compared. We show that all classifiers make significant and biologically valid predictions by comparing these predictions to annotated and putative NCR genes, and by performing several negative controls. In particular, the inferred NCR genes significantly overlap with putative NCR genes identified in three genome-wide experimental and bioinformatics studies.
These results suggest that our approach can successfully identify potential NCR genes. Hence, the dimensionality of the problem of identifying all genes involved in NCR is drastically reduced.
氮是所有生命形式必需的营养物质。与大多数单细胞生物一样,酿酒酵母优先转运和分解代谢优质氮源而非劣质氮源。氮分解代谢物阻遏(NCR)指的就是这种选择机制。所有已知的氮分解代谢途径均由四种调节因子调控。最终目标是推断出完整的氮分解代谢途径。生物信息学方法为识别潜在的NCR基因及剔除无意义的基因提供了可能。
我们提出一种机器学习方法,即将酿酒酵母中潜在NCR基因的识别表述为一个有监督的二类分类问题。分类器根据与基因上游非编码序列中GATA基序相关的大量变量来预测基因是否对NCR敏感。正训练集和负训练集分别由已注释的NCR基因和手动挑选的已知对NCR不敏感的基因组成。我们比较了不同的分类器和变量选择方法。通过将这些预测结果与已注释的和潜在的NCR基因进行比较,并进行若干阴性对照,我们发现所有分类器都做出了显著且符合生物学意义的预测。特别是,推断出的NCR基因与在三项全基因组实验和生物信息学研究中鉴定出的潜在NCR基因有显著重叠。
这些结果表明我们的方法能够成功识别潜在的NCR基因。因此,识别所有参与NCR的基因这一问题的维度大幅降低。