NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain.
PLoS One. 2021 Mar 24;16(3):e0248663. doi: 10.1371/journal.pone.0248663. eCollection 2021.
Measuring semantic similarity between sentences is a significant task in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and biomedical text mining. For this reason, the proposal of sentence similarity methods for the biomedical domain has attracted a lot of attention in recent years. However, most sentence similarity methods and experimental results reported in the biomedical domain cannot be reproduced for multiple reasons as follows: the copying of previous results without confirmation, the lack of source code and data to replicate both methods and experiments, and the lack of a detailed definition of the experimental setup, among others. As a consequence of this reproducibility gap, the state of the problem can be neither elucidated nor new lines of research be soundly set. On the other hand, there are other significant gaps in the literature on biomedical sentence similarity as follows: (1) the evaluation of several unexplored sentence similarity methods which deserve to be studied; (2) the evaluation of an unexplored benchmark on biomedical sentence similarity, called Corpus-Transcriptional-Regulation (CTR); (3) a study on the impact of the pre-processing stage and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (4) the lack of software and data resources for the reproducibility of methods and experiments in this line of research. Identified these open problems, this registered report introduces a detailed experimental setup, together with a categorization of the literature, to develop the largest, updated, and for the first time, reproducible experimental survey on biomedical sentence similarity. Our aforementioned experimental survey will be based on our own software replication and the evaluation of all methods being studied on the same software platform, which will be specially developed for this work, and it will become the first publicly available software library for biomedical sentence similarity. Finally, we will provide a very detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results.
衡量句子之间的语义相似度是自然语言处理 (NLP)、信息检索 (IR) 和生物医学文本挖掘等领域的一项重要任务。出于这个原因,近年来,生物医学领域提出了许多句子相似度方法,引起了广泛关注。然而,由于以下多种原因,大多数生物医学领域的句子相似度方法和实验结果无法被重现:未经确认就复制以前的结果,没有提供源代码和数据来重现方法和实验,以及缺乏对实验设置的详细定义等。由于这种可重复性差距,问题的现状既无法阐明,也无法合理地开展新的研究。另一方面,生物医学句子相似度文献中还存在其他一些重大差距,包括:(1) 评估几种值得研究的未探索句子相似度方法;(2) 评估生物医学句子相似度的未探索基准,称为 Corpus-Transcriptional-Regulation (CTR);(3) 研究预处理阶段和命名实体识别 (NER) 工具对句子相似度方法性能的影响;最后,(4) 缺乏用于重现该研究领域方法和实验的软件和数据资源。针对这些开放性问题,本注册报告介绍了一个详细的实验设置,以及对文献的分类,以开发最大、最新的、并且首次可重现的生物医学句子相似度实验调查。我们上述的实验调查将基于我们自己的软件复制,以及在专门为此工作开发的相同软件平台上评估所有正在研究的方法,这将成为第一个公开可用的生物医学句子相似度软件库。最后,我们将提供非常详细的可重现性协议和数据集作为补充材料,以允许精确重现我们的所有实验和结果。