Provoost Thomas, Moens Marie-Francine
BMC Bioinformatics. 2015;16 Suppl 10(Suppl 10):S4. doi: 10.1186/1471-2105-16-S10-S4. Epub 2015 Jul 13.
The BioNLP Gene Regulation Task has attracted a diverse collection of submissions showcasing state-of-the-art systems. However, a principal challenge remains in obtaining a significant amount of recall. We argue that this is an important quality for Information Extraction tasks in this field. We propose a semi-supervised framework, leveraging a large corpus of unannotated data available to us. In this framework, the annotated data is used to find plausible candidates for positive data points, which are included in the machine learning process. As this is a method principally designed for gaining recall, we further explore additional methods to improve precision on top of this. These are: weighted regularisation in the SVM framework, and filtering out unlabelled examples based on a probabilistic rule-finding method. The latter method also allows us to add candidates for negatives from unlabelled data, a method not viable in the unfiltered approach.
We replicate one of the original participant systems, and modify it to incorporate our methods. This allows us to test the extent of our proposed methods by applying them to the GRN task data. We find a considerable improvement in recall compared to the baseline system. We also investigate the evaluation metrics and find several mechanisms explaining a bias towards precision. Furthermore, these findings uncover an intricate precision-recall interaction, depriving recall of its habitual immediacy seen in traditional machine learning set-ups.
Our contributions are twofold.
生物自然语言处理基因调控任务吸引了各种各样展示最先进系统的提交成果。然而,在获得大量召回率方面仍然存在一个主要挑战。我们认为这是该领域信息提取任务的一项重要质量指标。我们提出了一个半监督框架,利用我们可获得的大量未标注数据。在这个框架中,标注数据用于为正数据点找到合理的候选者,这些候选者被纳入机器学习过程。由于这是一种主要为提高召回率而设计的方法,我们进一步探索在此基础上提高精确率的其他方法。这些方法是:支持向量机框架中的加权正则化,以及基于概率规则发现方法过滤未标注示例。后一种方法还允许我们从未标注数据中添加负例候选者,这在未过滤的方法中是不可行的。
我们复制了一个原始参与者系统,并对其进行修改以纳入我们的方法。这使我们能够通过将所提出的方法应用于基因调控网络任务数据来测试其效果。与基线系统相比,我们发现召回率有了相当大的提高。我们还研究了评估指标,并发现了几种解释偏向精确率的机制。此外,这些发现揭示了一种复杂的精确率 - 召回率相互作用,剥夺了召回率在传统机器学习设置中常见的直接性。
我们的贡献是双重的。