Bioinformatics Program, Department of Bioengineering, University of Illinois at Chicago, Chicago, IL 60607, USA.
BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S6. doi: 10.1186/1471-2105-11-S1-S6.
In supervised learning, traditional approaches to building a classifier use two sets of examples with pre-defined classes along with a learning algorithm. The main limitation of this approach is that examples from both classes are required which might be infeasible in certain cases, especially those dealing with biological data. Such is the case for membrane-binding peripheral domains that play important roles in many biological processes, including cell signaling and membrane trafficking by reversibly binding to membranes. For these domains, a well-defined positive set is available with domains known to bind membrane along with a large unlabeled set of domains whose membrane binding affinities have not been measured. The aforementioned limitation can be addressed by a special class of semi-supervised machine learning called positive-unlabeled (PU) learning that uses a positive set with a large unlabeled set. METHODS In this study, we implement the first application of PU-learning to a protein function prediction problem: identification of peripheral domains. PU-learning starts by identifying reliable negative (RN) examples iteratively from the unlabeled set until convergence and builds a classifier using the positive and the final RN set. A data set of 232 positive cases and ~3750 unlabeled ones were used to construct and validate the protocol.
Holdout evaluation of the protocol on a left-out positive set showed that the accuracy of prediction reached up to 95% during two independent implementations.
These results suggest that our protocol can be used for predicting membrane-binding properties of a wide variety of modular domains. Protocols like the one presented here become particularly useful in the case of availability of information from one class only.
在监督学习中,传统的分类器构建方法使用两组具有预定义类别的示例以及学习算法。这种方法的主要限制是需要来自两个类别的示例,这在某些情况下可能是不可行的,尤其是那些涉及生物数据的情况。这种情况适用于膜结合的外围结构域,这些结构域在外周结构域在许多生物过程中发挥着重要作用,包括通过可逆地与膜结合来进行细胞信号转导和膜运输。对于这些结构域,有一个定义明确的阳性集,其中包含已知与膜结合的结构域,以及一个包含大量未标记的结构域的集合,这些结构域的膜结合亲和力尚未测量。上述限制可以通过一种称为正-未标记(PU)学习的特殊半监督机器学习方法来解决,该方法使用带有大量未标记集的正集。
在这项研究中,我们首次将 PU 学习应用于蛋白质功能预测问题:识别外围结构域。PU 学习从未标记的集合中迭代地识别可靠的负例(RN),直到收敛,并使用正例和最终的 RN 集合构建分类器。使用 232 个阳性案例和大约 3750 个未标记案例的数据集来构建和验证该方案。
在两个独立的实现中,通过对一个保留的阳性集进行留一法评估,该方案的预测准确率高达 95%。
这些结果表明,我们的方案可用于预测各种模块化结构域的膜结合性质。在只有一类信息可用的情况下,像本文提出的方案变得特别有用。