School of Math and Computer Science, University of Habana, La Habana 10200, Cuba.
University Institute for Computing Research (IUII), University of Alicante, Alicante 03690, Spain; Department of Language and Computing Systems, University of Alicante, Alicante 03690, Spain.
J Biomed Inform. 2021 Apr;116:103716. doi: 10.1016/j.jbi.2021.103716. Epub 2021 Feb 26.
Corpora are one of the most valuable resources at present for building machine learning systems. However, building new corpora is an expensive task, which makes the automatic extension of corpora a highly attractive task to develop. Hence, finding new strategies that reduce the cost and effort involved in this task, while at the same time guaranteeing quality, remains an open and important challenge for the research community. In this paper, we present a set of ensembling strategies oriented toward entity and relation extraction tasks. The main goal is to combine several automatically annotated versions of corpora to produce a single version with improved quality. An ensembler is built by exploring a configuration space in search of the combination that maximizes the fitness of the ensembled collection according to a reference collection. The eHealth-KD 2019 challenge was chosen for the case study. The submitted systems' outputs were ensembled, resulting in the construction of an automatically annotated collection of 8000 sentences. We show that using this collection as additional training input for a baseline algorithm has a positive impact on its performance. Additionally, the ensembling pipeline was used as a participant system in the 2020 edition of the challenge. The ensembled run achieved a slightly better performance than the individual runs.
语料库是目前构建机器学习系统最有价值的资源之一。然而,构建新的语料库是一项昂贵的任务,这使得语料库的自动扩展成为一项极具吸引力的任务。因此,寻找新的策略来降低这项任务的成本和工作量,同时保证质量,仍然是研究界面临的一个开放和重要的挑战。
在本文中,我们提出了一组面向实体和关系抽取任务的集成策略。主要目标是通过组合多个自动标注的语料库版本,生成一个质量更高的单一版本。集成器通过探索配置空间来构建,以根据参考语料库来搜索最大化集成语料库适应性的组合。
eHealth-KD 2019 挑战赛被选为案例研究。提交的系统输出被集成,从而构建了一个 8000 个句子的自动标注语料库。我们表明,将该语料库作为基线算法的额外训练输入,对其性能有积极影响。此外,该集成管道还被用作该挑战赛 2020 年版的参赛系统。集成运行的性能略优于单个运行。