大海捞针：在药物专利检索与预测中利用人工智能

Needle in a haystack: Harnessing AI in drug patent searches and prediction.

作者信息

Ribeiro Leonardo Costa, Muzaka Valbona

机构信息

Departamento de Ciências Econômicas, Faculdade de Ciências Econômicas, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brasil.

Economic-History Department, Uppsala University, Uppsala, Sweden.

出版信息

PLoS One. 2024 Dec 2;19(12):e0311238. doi: 10.1371/journal.pone.0311238. eCollection 2024.

DOI:10.1371/journal.pone.0311238

PMID:39621674

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11611211/

Abstract

The classification codes granted by patent offices are useful instruments for simplifying the bewildering variety of patents in existence. They are singularly unhelpful, however, in locating a specific subgroup of patents such as that of drug-related pharmaceutical patents for which no classification codes exist. Taking advantage of advances in artificial intelligence and in natural language processing in particular, we offer a new method of identifying chemical drug-related patents in this article. The aim is primarily that of demonstrating how the proverbial needle in a haystack was identified, namely through leveraging the superb pattern-recognition abilities of the BERT (Bidirectional Encoder Representations from Transformers) algorithm. We build three different databases to train our algorithm and fine-tune its abilities to identify the patent group in question by exposing it to additional texts containing structures that are much more likely to be present in them, until we obtain the highest possible F1-score, combined with an accuracy of 94.40%. We also demonstrate some possible uses of the algorithm. Its application to the US patent office database enables the identification of potential chemical drug patents up to ten years before drug approval, whereas its application to the German patent office reveals the regional nature of drug R&D and patenting strategies. The hope is that both the method proposed and its applications will be further refined and expanded forthwith.

摘要

专利局授予的分类代码是简化现有令人眼花缭乱的各种专利的有用工具。然而，在查找特定的专利子类别时，它们却毫无帮助，比如查找不存在分类代码的与药物相关的制药专利。利用人工智能尤其是自然语言处理方面的进展，我们在本文中提供了一种识别与化学药物相关专利的新方法。其主要目的是展示如何找到那根 proverbial needle in a haystack（大海捞针），即通过利用BERT（Bidirectional Encoder Representations from Transformers，来自变换器的双向编码器表示）算法卓越的模式识别能力。我们构建了三个不同的数据库来训练我们的算法，并通过让其接触更多包含更可能出现的结构的文本，对其识别相关专利组的能力进行微调，直到我们获得尽可能高的F1分数，同时准确率达到94.40%。我们还展示了该算法的一些可能用途。将其应用于美国专利局数据库能够在药物获批前十年识别潜在的化学药物专利，而将其应用于德国专利局则揭示了药物研发和专利策略的区域性质。希望本文提出的方法及其应用能立即得到进一步完善和扩展。