Xu Shenbo, Cobzaru Raluca, Finkelstein Stan N, Welsch Roy E, Ng Kenney, Middleton Lefkos
Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA 02142, USA.
medRxiv. 2024 Sep 16:2024.08.04.24311480. doi: 10.1101/2024.08.04.24311480.
Developing medicine from scratch to governmental authorization and detecting adverse drug reactions (ADR) have barely been economical, expeditious, and risk-averse investments. The availability of large-scale observational healthcare databases and the popularity of large language models offer an unparalleled opportunity to enable automatic high-throughput drug screening for both repurposing and pharmacovigilance.
To demonstrate a general workflow for automatic high-throughput drug screening with the following advantages: (i) the association of various exposure on diseases can be estimated; (ii) both repurposing and pharmacovigilance are integrated; (iii) accurate exposure length for each prescription is parsed from clinical texts; (iv) intrinsic relationship between drugs and diseases are removed jointly by bioinformatic mapping and large language model - ChatGPT; (v) causal-wise interpretations for incidence rate contrasts are provided.
Using a self-controlled cohort study design where subjects serve as their own control group, we tested the intention-to-treat association between medications on the incidence of diseases. Exposure length for each prescription is determined by parsing common dosages in English free text into a structured format. Exposure period starts from initial prescription to treatment discontinuation. A same exposure length preceding initial treatment is the control period. Clinical outcomes and categories are identified using existing phenotyping algorithms. Incident rate ratios (IRR) are tested using uniformly most powerful (UMP) unbiased tests.
We assessed 3,444 medications on 276 diseases on 6,613,198 patients from the Clinical Practice Research Datalink (CPRD), an UK primary care electronic health records (EHR) spanning from 1987 to 2018. Due to the built-in selection bias of self-controlled cohort studies, ingredients-disease pairs confounded by deterministic medical relationships are removed by existing map from RxNorm and nonexistent maps by calling ChatGPT. A total of 16,901 drug-disease pairs reveals significant risk reduction, which can be considered as candidates for repurposing, while a total of 11,089 pairs showed significant risk increase, where drug safety might be of a concern instead.
This work developed a data-driven, nonparametric, hypothesis generating, and automatic high-throughput workflow, which reveals the potential of natural language processing in pharmacoepidemiology. We demonstrate the paradigm to a large observational health dataset to help discover potential novel therapies and adverse drug effects. The framework of this study can be extended to other observational medical databases.
从头研发药物直至获得政府批准,并检测药物不良反应,这几乎算不上经济、高效且规避风险的投资。大规模观察性医疗保健数据库的可用性以及大语言模型的普及,为实现用于药物再利用和药物警戒的自动高通量药物筛选提供了前所未有的机会。
展示一种用于自动高通量药物筛选的通用工作流程,具有以下优势:(i)可以估计各种暴露与疾病之间的关联;(ii)整合药物再利用和药物警戒;(iii)从临床文本中解析每个处方的准确暴露时长;(iv)通过生物信息映射和大语言模型ChatGPT共同消除药物与疾病之间的内在关系;(v)提供发病率对比的因果解释。
采用自我对照队列研究设计,即受试者作为自身的对照组,我们测试了药物与疾病发病率之间的意向性治疗关联。通过将英文自由文本中的常用剂量解析为结构化格式来确定每个处方的暴露时长。暴露期从初始处方开始至治疗终止。初始治疗前相同的暴露时长为对照期。使用现有的表型分析算法确定临床结局和类别。使用一致最强大(UMP)无偏检验来测试发病率比(IRR)。
我们对来自临床实践研究数据链(CPRD)的6,613,198名患者的276种疾病的3444种药物进行了评估,CPRD是一个涵盖1987年至2018年的英国初级保健电子健康记录(EHR)。由于自我对照队列研究存在固有的选择偏倚,通过RxNorm的现有映射以及调用ChatGPT创建不存在的映射,消除了由确定性医学关系混淆的成分 - 疾病对。总共16,901对药物 - 疾病对显示出显著的风险降低,可被视为药物再利用的候选对象,而总共11,089对显示出显著的风险增加,在这些情况下药物安全性可能更值得关注。
这项工作开发了一种数据驱动、非参数、生成假设的自动高通量工作流程,揭示了自然语言处理在药物流行病学中的潜力。我们向一个大型观察性健康数据集展示了该范式,以帮助发现潜在的新疗法和药物不良反应。本研究的框架可以扩展到其他观察性医学数据库。