Zhang Xiang, Hu Jiaxin, Lu Qian, Niu Lu, Wang Xinqi
IEEE J Biomed Health Inform. 2025 Apr;29(4):2922-2931. doi: 10.1109/JBHI.2024.3480998. Epub 2025 Apr 4.
Automatic extraction of valuable, structured evidence from the exponentially growing clinical trial literature can help physicians practice evidence-based medicine quickly and accurately. However, current research on evidence extraction has been limited by the lack of generalization ability on various clinical topics and the high cost of manual annotation. In this work, we address these challenges by constructing a PICO-based evidence dataset PICO-DS, covering five clinical topics. This dataset was automatically labeled by a distant supervision based on our proposed textual similarity algorithm called ROUGE-Hybrid. We then present an Aceso-DSAL model, an extension of our previous supervised evidence extraction model - Aceso. In Aceso-DSAL, distantly-labelled and multi-topic PICO-DS was exploited as training corpus, which greatly enhances the generalization of the extraction model. To mitigate the influence of noise unavoidably-introduced in distant supervision, we employ TextCNN and MW-Net models and a paradigm of active learning to weigh the value of each sample. We evaluate the effectiveness of our model on the PICO-DS dataset and find that it outperforms state-of-the-art studies in identifying evidential sentences.
从呈指数级增长的临床试验文献中自动提取有价值的结构化证据,有助于医生快速、准确地开展循证医学实践。然而,目前关于证据提取的研究受到对各种临床主题缺乏泛化能力以及人工标注成本高昂的限制。在这项工作中,我们通过构建一个基于PICO的证据数据集PICO-DS来应对这些挑战,该数据集涵盖五个临床主题。这个数据集是基于我们提出的名为ROUGE-Hybrid的文本相似性算法,通过远距离监督自动标注的。然后,我们提出了Aceso-DSAL模型,它是我们之前的监督证据提取模型Aceso的扩展。在Aceso-DSAL中,将远距离标注的多主题PICO-DS用作训练语料库,这极大地增强了提取模型的泛化能力。为了减轻远距离监督中不可避免引入的噪声的影响,我们采用TextCNN和MW-Net模型以及主动学习范式来权衡每个样本的价值。我们在PICO-DS数据集上评估了我们模型的有效性,发现它在识别证据性句子方面优于现有最先进的研究。