Division of Bioinformatics and Computational Biology, Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Science University, Portland, Oregon, United States of America.
PLoS One. 2011;6(11):e26160. doi: 10.1371/journal.pone.0026160. Epub 2011 Nov 4.
Computational prediction of Transcription Factor Binding Sites (TFBS) from sequence data alone is difficult and error-prone. Machine learning techniques utilizing additional environmental information about a predicted binding site (such as distances from the site to particular chromatin features) to determine its occupancy/functionality class show promise as methods to achieve more accurate prediction of true TFBS in silico. We evaluate the Bayesian Network (BN) and Support Vector Machine (SVM) machine learning techniques on four distinct TFBS data sets and analyze their performance. We describe the features that are most useful for classification and contrast and compare these feature sets between the factors.
Our results demonstrate good performance of classifiers both on TFBS for transcription factors used for initial training and for TFBS for other factors in cross-classification experiments. We find that distances to chromatin modifications (specifically, histone modification islands) as well as distances between such modifications to be effective predictors of TFBS occupancy, though the impact of individual predictors is largely TF specific. In our experiments, Bayesian network classifiers outperform SVM classifiers.
Our results demonstrate good performance of machine learning techniques on the problem of occupancy classification, and demonstrate that effective classification can be achieved using distances to chromatin features. We additionally demonstrate that cross-classification of TFBS is possible, suggesting the possibility of constructing a generalizable occupancy classifier capable of handling TFBS for many different transcription factors.
仅从序列数据预测转录因子结合位点(TFBS)是困难且容易出错的。利用预测结合位点的额外环境信息(例如,从位点到特定染色质特征的距离)来确定其占据/功能类别的机器学习技术,有望成为实现更准确的真 TFBS 计算预测的方法。我们评估了贝叶斯网络(BN)和支持向量机(SVM)机器学习技术在四个不同的 TFBS 数据集上的性能,并对其性能进行了分析。我们描述了最有助于分类的特征,并对比了不同因素之间的特征集。
我们的结果表明,分类器在用于初始训练的转录因子的 TFBS 以及在交叉分类实验中用于其他因子的 TFBS 上都具有良好的性能。我们发现,到染色质修饰(特别是组蛋白修饰岛)的距离以及这些修饰之间的距离是 TFBS 占据的有效预测因子,尽管个别预测因子的影响在很大程度上是 TF 特异性的。在我们的实验中,贝叶斯网络分类器优于 SVM 分类器。
我们的结果表明,机器学习技术在占据分类问题上具有良好的性能,并表明可以使用到染色质特征的距离来实现有效的分类。我们还证明了 TFBS 的交叉分类是可能的,这表明有可能构建一个可用于处理许多不同转录因子的 TFBS 的通用占据分类器。