Liao Zhijun, Wang Xinrui, Chen Xingyong, Zou Quan
Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, Fuzhou. China.
State Key Laboratory for Medical Genomics, Shanghai Institute of Hematology, Rui-Jin Hospital affiliated to School of Medicine, Shanghai Jiao Tong University, Shanghai. China.
Comb Chem High Throughput Screen. 2017;20(7):594-602. doi: 10.2174/1386207320666170314094951.
The Krüppel-like factors (KLFs) are a family of containing Zn finger(ZF) motif transcription factors with 18 members in human genome, among them, KLF18 is predicted by bioinformatics. KLFs possess various physiological function involving in a number of cancers and other diseases. Here we perform a binary-class classification of KLFs and non-KLFs by machine learning methods.
The protein sequences of KLFs and non-KLFs were searched from UniProt and randomly separate them into training dataset(containing positive and negative sequences) and test dataset(containing only negative sequences), after extracting the 188-dimensional(188D) feature vectors we carry out category with four classifiers(GBDT, libSVM, RF, and k-NN). On the human KLFs, we further dig into the evolutionary relationship and motif distribution, and finally we analyze the conserved amino acid residue of three zinc fingers.
The classifier model from training dataset were well constructed, and the highest specificity(Sp) was 99.83% from a library for support vector machine(libSVM) and all the correctly classified rates were over 70% for 10-fold cross-validation on test dataset. The 18 human KLFs can be further divided into 7 groups and the zinc finger domains were located at the carboxyl terminus, and many conserved amino acid residues including Cysteine and Histidine, and the span and interval between them were consistent in the three ZF domains.
Two classification models for KLFs prediction have been built by novel machine learning methods.
Krüppel样因子(KLFs)是一类含有锌指(ZF)基序的转录因子家族,人类基因组中有18个成员,其中KLF18是通过生物信息学预测出来的。KLFs具有多种生理功能,涉及多种癌症和其他疾病。在此,我们通过机器学习方法对KLFs和非KLFs进行二分类。
从UniProt中搜索KLFs和非KLFs的蛋白质序列,并随机将它们分为训练数据集(包含正序列和负序列)和测试数据集(仅包含负序列),在提取188维(188D)特征向量后,我们使用四个分类器(GBDT、libSVM、RF和k-NN)进行分类。对于人类KLFs,我们进一步深入研究其进化关系和基序分布,最后分析三个锌指的保守氨基酸残基。
训练数据集的分类器模型构建良好,支持向量机(libSVM)库的最高特异性(Sp)为99.83%,测试数据集的10折交叉验证的所有正确分类率均超过7...
通过新颖的机器学习方法构建了两种用于KLFs预测的分类模型。 (注:原文此处“测试数据集的10折交叉验证的所有正确分类率均超过7...”表述不完整)