Zheng Guangyong, Qian Ziliang, Yang Qing, Wei Chaochun, Xie Lu, Zhu Yangyong, Li Yixue
Department of Computing and Information Technology, Fudan University, 220 Handan Road, Shanghai 200433, PR China.
BMC Bioinformatics. 2008 Jun 16;9:282. doi: 10.1186/1471-2105-9-282.
Transcription factors (TFs) are core functional proteins which play important roles in gene expression control, and they are key factors for gene regulation network construction. Traditionally, they were identified and classified through experimental approaches. In order to save time and reduce costs, many computational methods have been developed to identify TFs from new proteins and to classify the resulted TFs. Though these methods have facilitated screening of TFs to some extent, low accuracy is still a common problem. With the fast growing number of new proteins, more precise algorithms for identifying TFs from new proteins and classifying the consequent TFs are in a high demand.
The support vector machine (SVM) algorithm was utilized to construct an automatic detector for TF identification, where protein domains and functional sites were employed as feature vectors. Error-correcting output coding (ECOC) algorithm, which was originated from information and communication engineering fields, was introduced to combine with support vector machine (SVM) methodology for TF classification. The overall success rates of identification and classification achieved 88.22% and 97.83% respectively. Finally, a web site was constructed to let users access our tools (see Availability and requirements section for URL).
The SVM method was a valid and stable means for TFs identification with protein domains and functional sites as feature vectors. Error-correcting output coding (ECOC) algorithm is a powerful method for multi-class classification problem. When combined with SVM method, it can remarkably increase the accuracy of TF classification using protein domains and functional sites as feature vectors. In addition, our work implied that ECOC algorithm may succeed in a broad range of applications in biological data mining.
转录因子(TFs)是核心功能蛋白,在基因表达调控中发挥重要作用,是构建基因调控网络的关键因素。传统上,它们是通过实验方法来鉴定和分类的。为了节省时间和降低成本,人们开发了许多计算方法来从新蛋白质中鉴定转录因子并对所得转录因子进行分类。尽管这些方法在一定程度上促进了转录因子的筛选,但低准确性仍是一个普遍问题。随着新蛋白质数量的快速增长,对从新蛋白质中鉴定转录因子并对后续转录因子进行分类的更精确算法的需求很高。
利用支持向量机(SVM)算法构建了一个用于转录因子鉴定的自动检测器,其中蛋白质结构域和功能位点被用作特征向量。引入了源自信息与通信工程领域的纠错输出编码(ECOC)算法,并将其与支持向量机(SVM)方法相结合用于转录因子分类。鉴定和分类的总体成功率分别达到了88.22%和97.83%。最后,构建了一个网站,让用户可以访问我们的工具(有关网址,请参阅可用性和要求部分)。
支持向量机方法是以蛋白质结构域和功能位点为特征向量进行转录因子鉴定的一种有效且稳定的手段。纠错输出编码(ECOC)算法是解决多类分类问题的一种强大方法。当与支持向量机方法结合时,它可以显著提高以蛋白质结构域和功能位点为特征向量进行转录因子分类的准确性。此外,我们的工作表明纠错输出编码算法可能在生物数据挖掘的广泛应用中取得成功。