Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA. mop13+
BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S57. doi: 10.1186/1471-2105-11-S1-S57.
Biological processes in cells are carried out by means of protein-protein interactions. Determining whether a pair of proteins interacts by wet-lab experiments is resource-intensive; only about 38,000 interactions, out of a few hundred thousand expected interactions, are known today. Active machine learning can guide the selection of pairs of proteins for future experimental characterization in order to accelerate accurate prediction of the human protein interactome.
Random forest (RF) has previously been shown to be effective for predicting protein-protein interactions. Here, four different active learning algorithms have been devised for selection of protein pairs to be used to train the RF. With labels of as few as 500 protein-pairs selected using any of the four active learning methods described here, the classifier achieved a higher F-score (harmonic mean of Precision and Recall) than with 3000 randomly chosen protein-pairs. F-score of predicted interactions is shown to increase by about 15% with active learning in comparison to that with random selection of data.
Active learning algorithms enable learning more accurate classifiers with much lesser labelled data and prove to be useful in applications where manual annotation of data is formidable. Active learning techniques demonstrated here can also be applied to other proteomics applications such as protein structure prediction and classification.
细胞中的生物过程是通过蛋白质-蛋白质相互作用来实现的。通过湿实验室实验确定一对蛋白质是否相互作用是资源密集型的;今天已知的相互作用只有几十万预期相互作用中的约 38000 个。主动机器学习可以指导选择未来用于实验表征的蛋白质对,以加速准确预测人类蛋白质相互作用组。
随机森林(RF)先前已被证明可有效预测蛋白质-蛋白质相互作用。在这里,设计了四种不同的主动学习算法来选择要用于训练 RF 的蛋白质对。使用这里描述的四种主动学习方法中的任何一种选择的标签数量仅为 500 对,分类器的 F 分数(精度和召回率的调和平均值)高于随机选择的 3000 对。与随机选择数据相比,主动学习将预测相互作用的 F 分数提高了约 15%。
主动学习算法可以使用更少的标记数据学习更准确的分类器,并在数据手动注释困难的应用中证明是有用的。这里展示的主动学习技术还可以应用于其他蛋白质组学应用,如蛋白质结构预测和分类。