IEEE/ACM Trans Comput Biol Bioinform. 2023 May-Jun;20(3):1864-1875. doi: 10.1109/TCBB.2022.3219375. Epub 2023 Jun 5.
Retrieval Question Answering (ReQA) is an essential mechanism of information sharing which aims to find the answer to a posed question from large-scale candidates. Currently, the most efficient solution is Dual-Encoder which has shown great potential in the general domain, while it still lacks research on biomedical ReQA. Obtaining a robust Dual-Encoder from biomedical datasets is challenging, as scarce annotated data are not enough to sufficiently train the model which results in over-fitting problems. In this work, we first build ReQA BioASQ datasets for retrieving answers to biomedical questions, which can facilitate the corresponding research. On that basis, we propose a framework to solve the over-fitting issue for robust biomedical answer retrieval. Under the proposed framework, we first pre-train Dual-Encoder on natural language inference (NLI) task before the training on biomedical ReQA, where we appropriately change the pre-training objective of NLI to improve the consistency between NLI and biomedical ReQA, which significantly improve the transferability. Moreover, to eliminate the feature redundancies of Dual-Encoder, consistent post-whitening is proposed to conduct decorrelation on the training and trained sentence embeddings. With extensive experiments, the proposed framework achieves promising results and exhibits significant improvement compared with various competitive methods.
检索式问答 (ReQA) 是一种重要的信息共享机制,旨在从大规模的候选者中找到提出问题的答案。目前,最有效的解决方案是双编码器,它在通用领域显示出了巨大的潜力,而在生物医学 ReQA 方面的研究还很缺乏。从生物医学数据集中获得稳健的双编码器具有挑战性,因为稀缺的注释数据不足以充分训练模型,从而导致过拟合问题。在这项工作中,我们首先构建了用于检索生物医学问题答案的 ReQA BioASQ 数据集,这将有助于相关研究。在此基础上,我们提出了一种解决稳健的生物医学答案检索中过拟合问题的框架。在提出的框架中,我们首先在生物医学 ReQA 训练之前,在自然语言推理 (NLI) 任务上对双编码器进行预训练,我们适当地改变 NLI 的预训练目标,以提高 NLI 和生物医学 ReQA 之间的一致性,从而显著提高可转移性。此外,为了消除双编码器的特征冗余,我们提出了一致的后白化,以对训练和训练后的句子嵌入进行去相关。通过广泛的实验,所提出的框架取得了有希望的结果,并与各种竞争方法相比表现出显著的改进。