Chen Tianhang, Wang Xiangeng, Chu Yanyi, Wang Yanjing, Jiang Mingming, Wei Dong-Qing, Xiong Yi
State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China.
Department of Biomedical Sciences, City University of Hong Kong, Hong Kong, China.
Front Microbiol. 2020 Sep 24;11:580382. doi: 10.3389/fmicb.2020.580382. eCollection 2020.
Type IV secreted effectors (T4SEs) can be translocated into the cytosol of host cells via type IV secretion system (T4SS) and cause diseases. However, experimental approaches to identify T4SEs are time- and resource-consuming, and the existing computational tools based on machine learning techniques have some obvious limitations such as the lack of interpretability in the prediction models. In this study, we proposed a new model, T4SE-XGB, which uses the eXtreme gradient boosting (XGBoost) algorithm for accurate identification of type IV effectors based on optimal features based on protein sequences. After trying 20 different types of features, the best performance was achieved when all features were fed into XGBoost by the 5-fold cross validation in comparison with other machine learning methods. Then, the ReliefF algorithm was adopted to get the optimal feature set on our dataset, which further improved the model performance. T4SE-XGB exhibited highest predictive performance on the independent test set and outperformed other published prediction tools. Furthermore, the SHAP method was used to interpret the contribution of features to model predictions. The identification of key features can contribute to improved understanding of multifactorial contributors to host-pathogen interactions and bacterial pathogenesis. In addition to type IV effector prediction, we believe that the proposed framework can provide instructive guidance for similar studies to construct prediction methods on related biological problems. The data and source code of this study can be freely accessed at https://github.com/CT001002/T4SE-XGB.
IV型分泌效应蛋白(T4SEs)可通过IV型分泌系统(T4SS)转运至宿主细胞胞质溶胶中并引发疾病。然而,鉴定T4SEs的实验方法既耗时又耗费资源,并且现有的基于机器学习技术的计算工具存在一些明显的局限性,例如预测模型缺乏可解释性。在本研究中,我们提出了一种新模型T4SE-XGB,该模型基于蛋白质序列的最优特征,使用极端梯度提升(XGBoost)算法来准确鉴定IV型效应蛋白。在尝试了20种不同类型的特征后,与其他机器学习方法相比,通过五折交叉验证将所有特征输入XGBoost时,获得了最佳性能。然后,采用ReliefF算法在我们的数据集上获得最优特征集,进一步提高了模型性能。T4SE-XGB在独立测试集上表现出最高的预测性能,优于其他已发表的预测工具。此外,使用SHAP方法来解释特征对模型预测的贡献。关键特征的鉴定有助于更好地理解宿主-病原体相互作用和细菌致病机制的多因素贡献者。除了IV型效应蛋白预测外,我们相信所提出的框架可为类似研究构建相关生物学问题的预测方法提供指导性的指导。本研究的数据和源代码可在https://github.com/CT001002/T4SE-XGB上免费获取。