Zhang Xiaolong, Lin Xiaoli, Zhao Jiafu, Huang Qianqian, Xu Xin
IEEE/ACM Trans Comput Biol Bioinform. 2019 May-Jun;16(3):774-781. doi: 10.1109/TCBB.2018.2871674.
Hot spot residues bring into play the vital function in bioinformatics to find new medications such as drug design. However, current datasets are predominately composed of non-hot spots with merely a tiny percentage of hot spots. Conventional hot spots prediction methods may face great challenges towards the problem of imbalance training samples. This paper presents a classification method combining with random forest classification and oversampling strategy to improve the training performance. A strategy with an oversampling ability is used to generate hot spots data to balance the given training set. Random forest classification is then invoked to generate a set of forest trees for this oversampled training set. The final prediction performance can be computed recursively after the oversampling and training process. This proposed method is capable of randomly selecting features and constructing a robust random forest to avoid overfitting the training set. Experimental results from three data sets indicate that the performance of hot spots prediction has been significantly improved compared with existing classification methods.
热点残基在生物信息学中发挥着至关重要的作用,有助于发现新药物,如药物设计。然而,当前的数据集主要由非热点组成,热点仅占很小的比例。传统的热点预测方法在面对训练样本不平衡问题时可能会面临巨大挑战。本文提出了一种结合随机森林分类和过采样策略的分类方法,以提高训练性能。采用具有过采样能力的策略来生成热点数据,以平衡给定的训练集。然后调用随机森林分类为这个过采样训练集生成一组森林树。在过采样和训练过程之后,可以递归地计算最终的预测性能。该方法能够随机选择特征并构建一个强大的随机森林,以避免过度拟合训练集。来自三个数据集的实验结果表明,与现有分类方法相比,热点预测的性能有了显著提高。