Wang Sheng, Ma Jianzhu, Xu Jinbo
Toyota Technological Institute at Chicago, Chicago, IL, USA Department of Human Genetics, University of Chicago, Chicago, IL, USA.
Toyota Technological Institute at Chicago, Chicago, IL, USA.
Bioinformatics. 2016 Sep 1;32(17):i672-i679. doi: 10.1093/bioinformatics/btw446.
Protein intrinsically disordered regions (IDRs) play an important role in many biological processes. Two key properties of IDRs are (i) the occurrence is proteome-wide and (ii) the ratio of disordered residues is about 6%, which makes it challenging to accurately predict IDRs. Most IDR prediction methods use sequence profile to improve accuracy, which prevents its application to proteome-wide prediction since it is time-consuming to generate sequence profiles. On the other hand, the methods without using sequence profile fare much worse than using sequence profile.
This article formulates IDR prediction as a sequence labeling problem and employs a new machine learning method called Deep Convolutional Neural Fields (DeepCNF) to solve it. DeepCNF is an integration of deep convolutional neural networks (DCNN) and conditional random fields (CRF); it can model not only complex sequence-structure relationship in a hierarchical manner, but also correlation among adjacent residues. To deal with highly imbalanced order/disorder ratio, instead of training DeepCNF by widely used maximum-likelihood, we develop a novel approach to train it by maximizing area under the ROC curve (AUC), which is an unbiased measure for class-imbalanced data.
Our experimental results show that our IDR prediction method AUCpreD outperforms existing popular disorder predictors. More importantly, AUCpreD works very well even without sequence profile, comparing favorably to or even outperforming many methods using sequence profile. Therefore, our method works for proteome-wide disorder prediction while yielding similar or better accuracy than the others.
http://raptorx2.uchicago.edu/StructurePropertyPred/predict/
wangsheng@uchicago.edu, jinboxu@gmail.com
Supplementary data are available at Bioinformatics online.
蛋白质内在无序区域(IDR)在许多生物过程中发挥着重要作用。IDR的两个关键特性是:(i)其存在具有蛋白质组范围;(ii)无序残基的比例约为6%,这使得准确预测IDR具有挑战性。大多数IDR预测方法使用序列概况来提高准确性,这阻碍了其在蛋白质组范围预测中的应用,因为生成序列概况很耗时。另一方面,不使用序列概况的方法比使用序列概况的方法效果差得多。
本文将IDR预测表述为一个序列标记问题,并采用一种名为深度卷积神经网络场(DeepCNF)的新机器学习方法来解决它。DeepCNF是深度卷积神经网络(DCNN)和条件随机场(CRF)的集成;它不仅可以以分层方式对复杂的序列 - 结构关系进行建模,还可以对相邻残基之间的相关性进行建模。为了处理高度不平衡的有序/无序比例,我们不是通过广泛使用的最大似然法来训练DeepCNF,而是开发了一种通过最大化ROC曲线下面积(AUC)来训练它的新方法,AUC是对类不平衡数据的一种无偏度量。
我们的实验结果表明,我们的IDR预测方法AUCpreD优于现有的流行无序预测器。更重要的是,即使没有序列概况,AUCpreD也表现得非常好,与许多使用序列概况的方法相比具有优势,甚至优于它们。因此,我们的方法适用于蛋白质组范围的无序预测,同时产生与其他方法相似或更好的准确性。
http://raptorx2.uchicago.edu/StructurePropertyPred/predict/
wangsheng@uchicago.edu,jinboxu@gmail.com
补充数据可在《生物信息学》在线获取。