Le Nguyen Quoc Khanh, Do Duyen Thi, Nguyen Trinh-Trung-Duong, Le Quynh Anh
Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei 106, Taiwan; Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei 106, Taiwan; Translational Imaging Research Center, Taipei Medical University Hospital, Taipei 110, Taiwan.
Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei 106, Taiwan.
Gene. 2021 Jun 30;787:145643. doi: 10.1016/j.gene.2021.145643. Epub 2021 Apr 18.
Krüppel-like factors (KLF) refer to a group of conserved zinc finger-containing transcription factors that are involved in various physiological and biological processes, including cell proliferation, differentiation, development, and apoptosis. Some bioinformatics methods such as sequence similarity searches, multiple sequence alignment, phylogenetic reconstruction, and gene synteny analysis have also been proposed to broaden our knowledge of KLF proteins. In this study, we proposed a novel computational approach by using machine learning on features calculated from primary sequences. To detail, our XGBoost-based model is efficient in identifying KLF proteins, with accuracy of 96.4% and MCC of 0.704. It also holds a promising performance when testing our model on an independent dataset. Therefore, our model could serve as an useful tool to identify new KLF proteins and provide necessary information for biologists and researchers in KLF proteins. Our machine learning source codes as well as datasets are freely available at https://github.com/khanhlee/KLF-XGB.
Krüppel样因子(KLF)是指一组保守的含锌指转录因子,它们参与多种生理和生物学过程,包括细胞增殖、分化、发育和凋亡。还提出了一些生物信息学方法,如序列相似性搜索、多序列比对、系统发育重建和基因共线性分析,以拓宽我们对KLF蛋白的认识。在本研究中,我们提出了一种新的计算方法,通过对从一级序列计算得到的特征进行机器学习。具体而言,我们基于XGBoost的模型在识别KLF蛋白方面很有效,准确率为96.4%,马修斯相关系数为0.704。在独立数据集上测试我们的模型时,它也具有良好的性能。因此,我们的模型可以作为一种有用的工具来识别新的KLF蛋白,并为研究KLF蛋白的生物学家和研究人员提供必要的信息。我们的机器学习源代码以及数据集可在https://github.com/khanhlee/KLF-XGB上免费获取。