Department of Information and Communications, Gwangju Institute of Science and Technology, Gwangju, Republic of Korea.
BMC Bioinformatics. 2009 Dec 31;10:455. doi: 10.1186/1471-2105-10-455.
Protein function prediction has been one of the most important issues in functional genomics. With the current availability of various genomic data sets, many researchers have attempted to develop integration models that combine all available genomic data for protein function prediction. These efforts have resulted in the improvement of prediction quality and the extension of prediction coverage. However, it has also been observed that integrating more data sources does not always increase the prediction quality. Therefore, selecting data sources that highly contribute to the protein function prediction has become an important issue.
We present systematic feature selection methods that assess the contribution of genome-wide data sets to predict protein functions and then investigate the relationship between genomic data sources and protein functions. In this study, we use ten different genomic data sources in Mus musculus, including: protein-domains, protein-protein interactions, gene expressions, phenotype ontology, phylogenetic profiles and disease data sources to predict protein functions that are labelled with Gene Ontology (GO) terms. We then apply two approaches to feature selection: exhaustive search feature selection using a kernel based logistic regression (KLR), and a kernel based L1-norm regularized logistic regression (KL1LR). In the first approach, we exhaustively measure the contribution of each data set for each function based on its prediction quality. In the second approach, we use the estimated coefficients of features as measures of contribution of data sources. Our results show that the proposed methods improve the prediction quality compared to the full integration of all data sources and other filter-based feature selection methods. We also show that contributing data sources can differ depending on the protein function. Furthermore, we observe that highly contributing data sets can be similar among a group of protein functions that have the same parent in the GO hierarchy.
In contrast to previous integration methods, our approaches not only increase the prediction quality but also gather information about highly contributing data sources for each protein function. This information can help researchers collect relevant data sources for annotating protein functions.
蛋白质功能预测一直是功能基因组学中最重要的问题之一。随着各种基因组数据集的当前可用性,许多研究人员试图开发整合模型,将所有可用的基因组数据结合起来进行蛋白质功能预测。这些努力提高了预测质量并扩大了预测范围。然而,也有人观察到,整合更多的数据源并不总是能提高预测质量。因此,选择对蛋白质功能预测有高度贡献的数据源已成为一个重要问题。
我们提出了系统的特征选择方法,评估了基因组数据集对预测蛋白质功能的贡献,然后研究了基因组数据源与蛋白质功能之间的关系。在这项研究中,我们使用了 10 种不同的基因组数据源在 Mus musculus 中,包括:蛋白质结构域、蛋白质-蛋白质相互作用、基因表达、表型本体、系统发育谱和疾病数据源,以预测用基因本体 (GO) 术语标记的蛋白质功能。然后,我们应用两种方法进行特征选择:基于核的逻辑回归(KLR)的穷举搜索特征选择,以及基于核的 L1-范数正则化逻辑回归(KL1LR)。在第一种方法中,我们根据预测质量,穷举测量每个数据集对每个功能的贡献。在第二种方法中,我们使用特征的估计系数作为数据源贡献的度量。我们的结果表明,与整合所有数据源和其他基于过滤器的特征选择方法相比,所提出的方法提高了预测质量。我们还表明,贡献数据源可能因蛋白质功能而异。此外,我们观察到,在 GO 层次结构中具有相同父级的一组蛋白质功能中,高度贡献的数据集可能相似。
与以前的整合方法不同,我们的方法不仅提高了预测质量,而且还收集了有关每个蛋白质功能的高度贡献数据源的信息。这些信息可以帮助研究人员收集注释蛋白质功能的相关数据源。