Pan Evan, Roberts Kirk
Department of Computer Science & Engineering, Texas A&M University, College Station, TX, USA.
School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA.
AMIA Jt Summits Transl Sci Proc. 2024 May 31;2024:642-651. eCollection 2024.
The results of clinical trials are a valuable source of evidence for researchers, policy makers, and healthcare professionals. However, online trial registries do not always contain links to the publications that report on their results, instead requiring a time-consuming manual search. Here, we explored the application of pre-trained transformer-based language models to automatically identify result-reporting publications of cancer clinical trials by computing dense vectors and performing semantic search. Models were fine-tuned on text data from trial registry fields and article metadata using a contrastive learning approach. The best performing model was PubMedBERT, which achieved a mean average precision of 0.592 and ranked 70.3% of a trial's publications in the top 5 results when tested on the holdout test trials. Our results suggest that semantic search using embeddings from transformer models may be an effective approach to the task of linking trials to their publications.
临床试验结果是研究人员、政策制定者和医疗保健专业人员的重要证据来源。然而,在线试验注册库并不总是包含指向报告其结果的出版物的链接,而是需要耗时的手动搜索。在此,我们探索了基于预训练变压器的语言模型的应用,通过计算密集向量和执行语义搜索来自动识别癌症临床试验的结果报告出版物。使用对比学习方法在试验注册库字段和文章元数据的文本数据上对模型进行微调。表现最佳的模型是PubMedBERT,在保留测试试验上进行测试时,其平均平均精度达到0.592,并且在试验的出版物中,有70.3%的出版物在前5个结果中排名。我们的结果表明,使用来自变压器模型的嵌入进行语义搜索可能是将试验与其出版物链接起来的任务的有效方法。