Department of Management Science and Engineering, Tongji University, Shanghai, China.
Department of Industrial, Manufacturing and Systems Engineering, University of Texas at Arlington, Arlington, Texas, USA.
J Comput Chem. 2022 Jul 30;43(20):1342-1354. doi: 10.1002/jcc.26937. Epub 2022 Jun 3.
Machine learning methods have helped to advance wide range of scientific and technological field in recent years, including computational chemistry. As the chemical systems could become complex with high dimension, feature selection could be critical but challenging to develop reliable machine learning based prediction models, especially for proteins as bio-macromolecules. In this study, we applied sparse group lasso (SGL) method as a general feature selection method to develop classification model for an allosteric protein in different functional states. This results into a much improved model with comparable accuracy (Acc) and only 28 selected features comparing to 289 selected features from a previous study. The Acc achieves 91.50% with 1936 selected feature, which is far higher than that of baseline methods. In addition, grouping protein amino acids into secondary structures provides additional interpretability of the selected features. The selected features are verified as associated with key allosteric residues through comparison with both experimental and computational works about the model protein, and demonstrate the effectiveness and necessity of applying rigorous feature selection and evaluation methods on complex chemical systems.
近年来,机器学习方法在包括计算化学在内的广泛科学技术领域取得了进展。由于化学系统可能变得复杂,维度高,特征选择对于开发可靠的基于机器学习的预测模型至关重要,但具有挑战性,特别是对于生物大分子蛋白质而言。在这项研究中,我们应用稀疏组套索(SGL)方法作为一般特征选择方法,为不同功能状态的别构蛋白开发分类模型。与之前的研究中从 289 个特征中选择相比,这得到了一个改进很多的模型,准确性(Acc)相当,只有 28 个特征。Acc 达到 91.50%,选择了 1936 个特征,远高于基线方法。此外,将蛋白质氨基酸分组为二级结构为所选特征提供了额外的可解释性。通过与模型蛋白质的实验和计算工作进行比较,对所选特征进行了验证,这些特征与关键别构残基相关,并证明了在复杂化学系统中应用严格的特征选择和评估方法的有效性和必要性。