Suppr超能文献

基于全进化谱的深度卷积神经网络从序列预测 DNA 结合蛋白。

Enabling full-length evolutionary profiles based deep convolutional neural network for predicting DNA-binding proteins from sequence.

机构信息

School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi, India.

出版信息

Proteins. 2020 Jan;88(1):15-30. doi: 10.1002/prot.25763. Epub 2019 Jul 8.

Abstract

Sequence based DNA-binding protein (DBP) prediction is a widely studied biological problem. Sliding windows on position specific substitution matrices (PSSMs) rows predict DNA-binding residues well on known DBPs but the same models cannot be applied to unequally sized protein sequences. PSSM summaries representing column averages and their amino-acid wise versions have been effectively used for the task, but it remains unclear if these features carry all the PSSM's predictive power, traditionally harnessed for binding site predictions. Here we evaluate if PSSMs scaled up to a fixed size by zero-vector padding (pPSSM) could perform better than the summary based features on similar models. Using multilayer perceptron (MLP) and deep convolutional neural network (CNN), we found that (a) Summary features work well for single-genome (human-only) data but are outperformed by pPSSM for diverse PDB-derived data sets, suggesting greater summary-level redundancy in the former, (b) even when summary features work comparably well with pPSSM, a consensus on the two outperforms both of them (c) CNN models comprehensively outperform their corresponding MLP models and (d) actual predicted scores from different models depend on the choice of input feature sets used whereas overall performance levels are model-dependent in which CNN leads the accuracy.

摘要

基于序列的 DNA 结合蛋白 (DBP) 预测是一个广泛研究的生物学问题。基于位置特异性替换矩阵 (PSSM) 行的滑动窗口很好地预测了已知 DBP 中的 DNA 结合残基,但相同的模型不能应用于大小不均的蛋白质序列。代表列平均值及其氨基酸版本的 PSSM 摘要已被有效地用于该任务,但仍不清楚这些特征是否具有传统上用于结合位点预测的 PSSM 的所有预测能力。在这里,我们评估了通过零向量填充 (pPSSM) 扩展到固定大小的 PSSM 是否可以在类似的模型上优于基于摘要的特征。使用多层感知机 (MLP) 和深度卷积神经网络 (CNN),我们发现:(a) 摘要特征在单基因组(仅人类)数据上表现良好,但在来自多样化 PDB 的数据集上表现不如 pPSSM,这表明前者在摘要级别上存在更大的冗余;(b) 即使摘要特征与 pPSSM 表现相当,两者的共识也优于两者;(c) CNN 模型全面优于其对应的 MLP 模型;(d) 来自不同模型的实际预测得分取决于输入特征集的选择,而整体性能水平则取决于模型,其中 CNN 领先于准确性。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验