Huang Fang, Shen Jiawei, Guo Qingli, Shi Yongyong
Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders (Ministry of Education) and the Collaborative Innovation Center for Brain Science, Shanghai Jiao Tong University, Shanghai, 200030 People's Republic of China.
Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders (Ministry of Education) and the Collaborative Innovation Center for Brain Science, Shanghai Jiao Tong University, Shanghai, 200030 People's Republic of China ; Shanghai Changning Mental Health Center, Shanghai, 200042 People's Republic of China ; Department of Psychiatry, The First Teaching Hospital of Xinjiang Medical University, Urumqi, 830054 People's Republic of China ; The Bio-X Little White Building, Shanghai Jiao Tong University, No.55 Guang Yuan Xi Road, Shanghai, 200030 China.
Hereditas. 2016 Jun 30;153:6. doi: 10.1186/s41065-016-0012-2. eCollection 2016.
Enhancers are tissue specific distal regulation elements, playing vital roles in gene regulation and expression. The prediction and identification of enhancers are important but challenging issues for bioinformatics studies. Existing computational methods, mostly single classifiers, can only predict the transcriptional coactivator EP300 based enhancers and show low generalization performance.
We built a hybrid classifier called eRFSVM in this study, using random forests as a base classifier, and support vector machines as a main classifier. eRFSVM integrated two components as eRFSVM-ENCODE and eRFSVM-FANTOM5 with diverse features and labels. The base classifier trained datasets from a single tissue or cell with random forests. The main classifier made the final decision by support vector machines algorithm, with the predicting results of base classifiers as inputs. For eRFSVM-ENCODE, we trained datasets from cell lines including Gm12878, Hep, H1-hesc and Huvec, using ChIP-Seq datasets as features and EP300 based enhancers as labels. We tested eRFSVM-ENCODE on K562 dataset, and resulted in a predicting precision of 83.69 %, which was much better than existing classifiers. For eRFSVM-FANTOM5, with enhancers identified by RNA in FANTOM5 project as labels, the precision, recall, F-score and accuracy were 86.17 %, 36.06 %, 50.84 % and 93.38 % using eRFSVM, increasing 23.24 % (69.92 %), 97.05 % (18.30 %), 76.90 % (28.74 %), 4.69 % (89.20 %) than the existing algorithm, respectively.
All these results demonstrated that eRFSVM was a better classifier in predicting both EP300 based and FAMTOM5 RNAs based enhancers.
增强子是组织特异性的远端调控元件,在基因调控和表达中发挥着至关重要的作用。增强子的预测和识别是生物信息学研究中的重要但具有挑战性的问题。现有的计算方法大多是单一分类器,只能预测基于转录共激活因子EP300的增强子,并且泛化性能较低。
在本研究中,我们构建了一个名为eRFSVM的混合分类器,使用随机森林作为基础分类器,支持向量机作为主分类器。eRFSVM将两个组件整合为具有不同特征和标签的eRFSVM-ENCODE和eRFSVM-FANTOM5。基础分类器使用随机森林对来自单个组织或细胞的数据集进行训练。主分类器以基础分类器的预测结果作为输入,通过支持向量机算法做出最终决策。对于eRFSVM-ENCODE,我们使用ChIP-Seq数据集作为特征,基于EP300的增强子作为标签,对来自包括Gm12878、Hep、H1-hesc和Huvec在内的细胞系的数据集进行训练。我们在K562数据集上对eRFSVM-ENCODE进行了测试,预测精度达到了83.69%,远优于现有分类器。对于eRFSVM-FANTOM5,以FANTOM5项目中通过RNA鉴定的增强子作为标签,使用eRFSVM时的精确率、召回率、F值和准确率分别为86.17%、36.06%、50.84%和93.38%,分别比现有算法提高了23.24%(69.92%)、97.05%(18.30%)、76.90%(28.74%)、4.69%(89.20%)。
所有这些结果表明,eRFSVM在预测基于EP300的增强子和基于FAMTOM5 RNA的增强子方面都是一个更好的数据分类器。