Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA. huo1+
BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S58. doi: 10.1186/1471-2105-11-S1-S58.
About 30% of genes code for membrane proteins, which are involved in a wide variety of crucial biological functions. Despite their importance, experimentally determined structures correspond to only about 1.7% of protein structures deposited in the Protein Data Bank due to the difficulty in crystallizing membrane proteins. Algorithms that can identify proteins whose high-resolution structure can aid in predicting the structure of many previously unresolved proteins are therefore of potentially high value. Active machine learning is a supervised machine learning approach which is suitable for this domain where there are a large number of sequences but only very few have known corresponding structures. In essence, active learning seeks to identify proteins whose structure, if revealed experimentally, is maximally predictive of others.
An active learning approach is presented for selection of a minimal set of proteins whose structures can aid in the determination of transmembrane helices for the remaining proteins. TMpro, an algorithm for high accuracy TM helix prediction we previously developed, is coupled with active learning. We show that with a well-designed selection procedure, high accuracy can be achieved with only few proteins. TMpro, trained with a single protein achieved an F-score of 94% on benchmark evaluation and 91% on MPtopo dataset, which correspond to the state-of-the-art accuracies on TM helix prediction that are achieved usually by training with over 100 training proteins.
Active learning is suitable for bioinformatics applications, where manually characterized data are not a comprehensive representation of all possible data, and in fact can be a very sparse subset thereof. It aids in selection of data instances which when characterized experimentally can improve the accuracy of computational characterization of remaining raw data. The results presented here also demonstrate that the feature extraction method of TMpro is well designed, achieving a very good separation between TM and non TM segments.
约 30%的基因编码膜蛋白,这些蛋白参与了广泛的重要生物功能。尽管它们很重要,但由于膜蛋白结晶的难度,在蛋白质数据库中仅约 1.7%的蛋白质结构是通过实验确定的。因此,能够识别出其高分辨率结构有助于预测许多以前未解决的蛋白质结构的算法具有潜在的高价值。主动学习是一种监督机器学习方法,适用于这个领域,这个领域有大量的序列,但只有极少数有已知的相应结构。从本质上讲,主动学习旨在识别那些如果通过实验揭示其结构,就能最大程度地预测其他结构的蛋白质。
提出了一种主动学习方法,用于选择一组最少数量的蛋白质,这些蛋白质的结构可以帮助确定其余蛋白质的跨膜螺旋。我们之前开发的用于高精度 TM 螺旋预测的算法 TMpro 与主动学习相结合。我们表明,通过精心设计的选择程序,可以仅使用少量蛋白质就能实现高精度。仅使用一个蛋白质进行训练的 TMpro 在基准评估中达到了 94%的 F-score,在 MPtopo 数据集上达到了 91%,这与通常通过使用 100 多个训练蛋白质进行训练才能达到的 TM 螺旋预测的最新精度相当。
主动学习适用于生物信息学应用,在这些应用中,人工标记的数据并不是所有可能数据的全面代表,实际上可能只是其中非常稀疏的一部分。它有助于选择那些在实验中进行特征描述后可以提高对剩余原始数据的计算特征描述精度的数据实例。这里呈现的结果还表明,TMpro 的特征提取方法设计得很好,在 TM 和非 TM 片段之间实现了很好的分离。