School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi, India.
Proteins. 2020 Jan;88(1):15-30. doi: 10.1002/prot.25763. Epub 2019 Jul 8.
Sequence based DNA-binding protein (DBP) prediction is a widely studied biological problem. Sliding windows on position specific substitution matrices (PSSMs) rows predict DNA-binding residues well on known DBPs but the same models cannot be applied to unequally sized protein sequences. PSSM summaries representing column averages and their amino-acid wise versions have been effectively used for the task, but it remains unclear if these features carry all the PSSM's predictive power, traditionally harnessed for binding site predictions. Here we evaluate if PSSMs scaled up to a fixed size by zero-vector padding (pPSSM) could perform better than the summary based features on similar models. Using multilayer perceptron (MLP) and deep convolutional neural network (CNN), we found that (a) Summary features work well for single-genome (human-only) data but are outperformed by pPSSM for diverse PDB-derived data sets, suggesting greater summary-level redundancy in the former, (b) even when summary features work comparably well with pPSSM, a consensus on the two outperforms both of them (c) CNN models comprehensively outperform their corresponding MLP models and (d) actual predicted scores from different models depend on the choice of input feature sets used whereas overall performance levels are model-dependent in which CNN leads the accuracy.
基于序列的 DNA 结合蛋白 (DBP) 预测是一个广泛研究的生物学问题。基于位置特异性替换矩阵 (PSSM) 行的滑动窗口很好地预测了已知 DBP 中的 DNA 结合残基,但相同的模型不能应用于大小不均的蛋白质序列。代表列平均值及其氨基酸版本的 PSSM 摘要已被有效地用于该任务,但仍不清楚这些特征是否具有传统上用于结合位点预测的 PSSM 的所有预测能力。在这里,我们评估了通过零向量填充 (pPSSM) 扩展到固定大小的 PSSM 是否可以在类似的模型上优于基于摘要的特征。使用多层感知机 (MLP) 和深度卷积神经网络 (CNN),我们发现:(a) 摘要特征在单基因组(仅人类)数据上表现良好,但在来自多样化 PDB 的数据集上表现不如 pPSSM,这表明前者在摘要级别上存在更大的冗余;(b) 即使摘要特征与 pPSSM 表现相当,两者的共识也优于两者;(c) CNN 模型全面优于其对应的 MLP 模型;(d) 来自不同模型的实际预测得分取决于输入特征集的选择,而整体性能水平则取决于模型,其中 CNN 领先于准确性。