Deng Xutao, Geng Huimin, Ali Hesham
Dept of Comp. Sci., University of Nebraska at Omaha, Omaha, NE 68182, USA.
Proc IEEE Comput Syst Bioinform Conf. 2005:25-34. doi: 10.1109/csb.2005.38.
We developed a machine learning system for determining gene functions from heterogeneous sources of data sets using a Weighted Naive Bayesian Network (WNB). The knowledge of gene functions is crucial for understanding many fundamental biological mechanisms such as regulatory pathways, cell cycles and diseases. Our major goal is to accurately infer functions of putative genes or ORFs (Open Reading Frames) from existing databases using computational methods. However, this task is intrinsically difficult since the underlying biological processes represent complex interactions of multiple entities. Therefore many functional links would be missing when only one or two source of data is used in the prediction. Our hypothesis is that integrating evidence from multiple and complementary sources could significantly improve the prediction accuracy. In this paper, our experimental results not only suggest that the above hypothesis is valid, but also provide guidelines for using the WNB system for data collection, training and predictions. The combined training data sets contain information from gene annotations, gene expressions, clustering outputs, keyword annotations and sequence homology from public databases. The current system is trained and tested on the genes of budding yeast Saccharomyces cerevisiae. Our WNB model can also be used to analyze the contribution of each source of information toward the prediction performance through the weight training process. The contribution analysis could potentially lead to significant scientific discovery by facilitating the interpretation and understanding of the complex relationships between biological entities.
我们开发了一种机器学习系统,用于使用加权朴素贝叶斯网络(WNB)从异构数据集确定基因功能。基因功能知识对于理解许多基本生物学机制(如调控途径、细胞周期和疾病)至关重要。我们的主要目标是使用计算方法从现有数据库中准确推断假定基因或开放阅读框(ORF)的功能。然而,这项任务本质上很困难,因为潜在的生物学过程代表了多个实体的复杂相互作用。因此,在预测中仅使用一两个数据源时,许多功能联系将会缺失。我们的假设是,整合来自多个互补数据源的证据可以显著提高预测准确性。在本文中,我们的实验结果不仅表明上述假设是有效的,还为使用WNB系统进行数据收集、训练和预测提供了指导方针。组合训练数据集包含来自基因注释、基因表达、聚类输出、关键词注释以及公共数据库中的序列同源性的信息。当前系统在酿酒酵母的基因上进行了训练和测试。我们的WNB模型还可以通过权重训练过程来分析每个信息源对预测性能的贡献。通过促进对生物实体之间复杂关系的解释和理解,贡献分析可能会带来重大的科学发现。