Department of Science, University of Sannio, Benevento, Italy.
BMC Bioinformatics. 2013;14 Suppl 1(Suppl 1):S3. doi: 10.1186/1471-2105-14-S1-S3. Epub 2013 Jan 14.
Supervised machine learning approaches have been recently adopted in the inference of transcriptional targets from high throughput trascriptomic and proteomic data showing major improvements from with respect to the state of the art of reverse gene regulatory network methods. Beside traditional unsupervised techniques, a supervised classifier learns, from known examples, a function that is able to recognize new relationships for new data. In the context of gene regulatory inference a supervised classifier is coerced to learn from positive and unlabeled examples, as the counter negative examples are unavailable or hard to collect. Such a condition could limit the performance of the classifier especially when the amount of training examples is low.
In this paper we improve the supervised identification of transcriptional targets by selecting reliable counter negative examples from the unlabeled set. We introduce an heuristic based on the known topology of transcriptional networks that in fact restores the conventional positive/negative training condition and shows a significant improvement of the classification performance. We empirically evaluate the proposed heuristic with the experimental datasets of Escherichia coli and show an example of application in the prediction of BCL6 direct core targets in normal germinal center human B cells obtaining a precision of 60%.
The availability of only positive examples in learning transcriptional relationships negatively affects the performance of supervised classifiers. We show that the selection of reliable negative examples, a practice adopted in text mining approaches, improves the performance of such classifiers opening new perspectives in the identification of new transcriptional targets.
最近,监督机器学习方法已被用于从高通量转录组学和蛋白质组学数据中推断转录靶标,与反向基因调控网络方法的最新技术相比,这些方法取得了重大改进。除了传统的无监督技术外,监督分类器还可以从已知示例中学习到一种能够识别新数据中新关系的功能。在基因调控推断的背景下,监督分类器被迫从阳性和未标记的示例中学习,因为无法获得或难以收集反例。这种情况可能会限制分类器的性能,尤其是在训练示例数量较少的情况下。
在本文中,我们通过从未标记集中选择可靠的反例,改进了转录靶标监督识别。我们引入了一种基于转录网络已知拓扑结构的启发式方法,该方法实际上恢复了传统的正/负训练条件,并显著提高了分类性能。我们使用大肠杆菌的实验数据集对提出的启发式方法进行了实证评估,并展示了在预测正常生发中心人类 B 细胞中 BCL6 直接核心靶标的应用示例,获得了 60%的精度。
在学习转录关系时只有阳性示例可用,这会对监督分类器的性能产生负面影响。我们表明,选择可靠的反例(文本挖掘方法中采用的实践)可以提高此类分类器的性能,为识别新的转录靶标开辟了新的前景。