Zhao Xing-Ming, Wang Yong, Chen Luonan, Aihara Kazuyuki
ERATO Aihara Complexity Modelling Project, JST, 4-6-1 Komaba, Meguro, Tokyo, Japan.
BMC Bioinformatics. 2008 Jan 28;9:57. doi: 10.1186/1471-2105-9-57.
In general, gene function prediction can be formalized as a classification problem based on machine learning technique. Usually, both labeled positive and negative samples are needed to train the classifier. For the problem of gene function prediction, however, the available information is only about positive samples. In other words, we know which genes have the function of interested, while it is generally unclear which genes do not have the function, i.e. the negative samples. If all the genes outside of the target functional family are seen as negative samples, the imbalanced problem will arise because there are only a relatively small number of genes annotated in each family. Furthermore, the classifier may be degraded by the false negatives in the heuristically generated negative samples.
In this paper, we present a new technique, namely Annotating Genes with Positive Samples (AGPS), for defining negative samples in gene function prediction. With the defined negative samples, it is straightforward to predict the functions of unknown genes. In addition, the AGPS algorithm is able to integrate various kinds of data sources to predict gene functions in a reliable and accurate manner. With the one-class and two-class Support Vector Machines as the core learning algorithm, the AGPS algorithm shows good performances for function prediction on yeast genes.
We proposed a new method for defining negative samples in gene function prediction. Experimental results on yeast genes show that AGPS yields good performances on both training and test sets. In addition, the overlapping between prediction results and GO annotations on unknown genes also demonstrates the effectiveness of the proposed method.
一般来说,基因功能预测可以形式化为基于机器学习技术的分类问题。通常,训练分类器需要有标记的正样本和负样本。然而,对于基因功能预测问题,可用信息仅关于正样本。换句话说,我们知道哪些基因具有感兴趣的功能,而通常不清楚哪些基因不具有该功能,即负样本。如果将目标功能家族之外的所有基因都视为负样本,就会出现不平衡问题,因为每个家族中注释的基因数量相对较少。此外,分类器可能会因启发式生成的负样本中的假阴性而性能下降。
在本文中,我们提出了一种新技术,即利用正样本注释基因(AGPS),用于在基因功能预测中定义负样本。利用定义好的负样本,预测未知基因的功能就变得很直接。此外,AGPS算法能够整合各种数据源,以可靠且准确的方式预测基因功能。以一类和二类支持向量机作为核心学习算法,AGPS算法在酵母基因的功能预测方面表现良好。
我们提出了一种在基因功能预测中定义负样本的新方法。酵母基因的实验结果表明,AGPS在训练集和测试集上均表现良好。此外,预测结果与未知基因的GO注释之间的重叠也证明了所提方法的有效性。