IEEE/ACM Trans Comput Biol Bioinform. 2020 Jan-Feb;17(1):124-135. doi: 10.1109/TCBB.2018.2858806. Epub 2018 Jul 23.
Most past works for DNA-binding residue prediction did not consider the relationships between residues. In this paper, we propose a novel approach for DNA-binding residue prediction, referred to as EL_LSTM, which includes two main components. The first component is the Long Short-Term Memory (LSTM), which learns pairwise relationships between residues through a bi-gram model and then learns feature vectors for all residues. The second component is an ensemble learning based classifier introduced to tackle the data imbalance problem in binding residue predictions. We use a variant of the bagging strategy in ensemble learning to achieve balanced samples. Evaluations on PDNA-224 and DBP-123 show that adding feature relationships performs better than classifiers without feature relationships by at least 0.028 on MCC, 1.18 percent on ST and 0.012 on AUC. This indicates the usefulness of feature relationships for DNA-binding residue predictions. Evaluation on using ensemble learning indicates that the improvement can reach at least 0.021 on MCC, 1.32 percent on ST, and 0.018 on AUC compared to the use of a single LSTM classifier. Comparisons with the state-of-the-art predictors show that our proposed EL_LSTM outperforms them significantly. Further feature analysis validates the effectiveness of LSTM for the prediction of DNA-binding residues.
大多数过去的 DNA 结合残基预测工作都没有考虑残基之间的关系。在本文中,我们提出了一种新的 DNA 结合残基预测方法,称为 EL_LSTM,它包括两个主要部分。第一部分是长短期记忆 (LSTM),它通过双字模型学习残基之间的成对关系,然后学习所有残基的特征向量。第二部分是基于集成学习的分类器,用于解决结合残基预测中的数据不平衡问题。我们使用集成学习中的一种袋装策略变体来实现平衡样本。在 PDNA-224 和 DBP-123 上的评估表明,通过 MCC 至少提高 0.028、ST 提高 1.18%和 AUC 提高 0.012,添加特征关系比没有特征关系的分类器表现更好。这表明特征关系对于 DNA 结合残基预测是有用的。使用集成学习的评估表明,与使用单个 LSTM 分类器相比,MCC 可以提高至少 0.021,ST 提高 1.32%,AUC 提高 0.018。与最先进的预测器的比较表明,我们提出的 EL_LSTM 明显优于它们。进一步的特征分析验证了 LSTM 对 DNA 结合残基预测的有效性。