Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia.
Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia.
Bioinformatics. 2024 Jun 28;40(Suppl 1):i401-i409. doi: 10.1093/bioinformatics/btae237.
Automated protein function prediction is a crucial and widely studied problem in bioinformatics. Computationally, protein function is a multilabel classification problem where only positive samples are defined and there is a large number of unlabeled annotations. Most existing methods rely on the assumption that the unlabeled set of protein function annotations are negatives, inducing the false negative issue, where potential positive samples are trained as negatives. We introduce a novel approach named PU-GO, wherein we address function prediction as a positive-unlabeled ranking problem. We apply empirical risk minimization, i.e. we minimize the classification risk of a classifier where class priors are obtained from the Gene Ontology hierarchical structure. We show that our approach is more robust than other state-of-the-art methods on similarity-based and time-based benchmark datasets.
Data and code are available at https://github.com/bio-ontology-research-group/PU-GO.
自动化蛋白质功能预测是生物信息学中一个至关重要且广泛研究的问题。从计算角度来看,蛋白质功能是一个多标签分类问题,只有正样本被定义,并且有大量未标记的注释。大多数现有方法依赖于一个假设,即未标记的蛋白质功能注释集是负样本,从而导致假阴性问题,即潜在的正样本被训练为负样本。我们引入了一种名为 PU-GO 的新方法,其中我们将功能预测作为一个正-未标记的排序问题来处理。我们应用经验风险最小化,即我们最小化分类器的分类风险,其中类先验从基因本体论层次结构中获得。我们表明,我们的方法在基于相似性和基于时间的基准数据集上比其他最先进的方法更稳健。
数据和代码可在 https://github.com/bio-ontology-research-group/PU-GO 上获得。