University of Pittsburgh School of Medicine, Department of Biomedical Informatics, Pittsburgh, PA, United States.
The University of Texas MD Anderson Cancer Center, Department of Biostatistics, Houston, TX, United States.
J Biomed Inform. 2021 May;117:103719. doi: 10.1016/j.jbi.2021.103719. Epub 2021 Mar 11.
Drug safety research asks causal questions but relies on observational data. Confounding bias threatens the reliability of studies using such data. The successful control of confounding requires knowledge of variables called confounders affecting both the exposure and outcome of interest. However, causal knowledge of dynamic biological systems is complex and challenging. Fortunately, computable knowledge mined from the literature may hold clues about confounders. In this paper, we tested the hypothesis that incorporating literature-derived confounders can improve causal inference from observational data.
We introduce two methods (semantic vector-based and string-based confounder search) that query literature-derived information for confounder candidates to control, using SemMedDB, a database of computable knowledge mined from the biomedical literature. These methods search SemMedDB for confounders by applying semantic constraint search for indications treated by the drug (exposure) and that are also known to cause the adverse event (outcome). We then include the literature-derived confounder candidates in statistical and causal models derived from free-text clinical notes. For evaluation, we use a reference dataset widely used in drug safety containing labeled pairwise relationships between drugs and adverse events and attempt to rediscover these relationships from a corpus of 2.2 M NLP-processed free-text clinical notes. We employ standard adjustment and causal inference procedures to predict and estimate causal effects by informing the models with varying numbers of literature-derived confounders and instantiating the exposure, outcome, and confounder variables in the models with dichotomous EHR-derived data. Finally, we compare the results from applying these procedures with naive measures of association (χ and reporting odds ratio) and with each other.
We found semantic vector-based search to be superior to string-based search at reducing confounding bias. However, the effect of including more rather than fewer literature-derived confounders was inconclusive. We recommend using targeted learning estimation methods that can address treatment-confounder feedback, where confounders also behave as intermediate variables, and engaging subject-matter experts to adjudicate the handling of problematic covariates.
药物安全研究提出因果问题,但依赖于观察性数据。混杂偏差威胁着使用此类数据进行研究的可靠性。成功控制混杂需要了解同时影响暴露和感兴趣结局的变量,这些变量称为混杂因素。然而,动态生物系统的因果知识复杂且具有挑战性。幸运的是,从文献中挖掘出的可计算知识可能包含混杂因素的线索。在本文中,我们检验了一个假设,即纳入文献衍生的混杂因素可以提高从观察性数据中进行因果推断的能力。
我们引入了两种方法(基于语义向量和基于字符串的混杂因素搜索),使用 SemMedDB(从生物医学文献中挖掘出的可计算知识数据库)查询文献衍生信息以寻找需要控制的混杂因素候选者。这些方法通过对药物(暴露)治疗的适应证以及已知会导致不良事件(结局)的适应证进行语义约束搜索,在 SemMedDB 中搜索混杂因素。然后,我们将文献衍生的混杂因素候选者纳入从 220 万份自然语言处理(NLP)处理的临床笔记中提取的统计和因果模型中。为了评估,我们使用了一个在药物安全领域广泛使用的参考数据集,该数据集包含药物和不良事件之间的标记成对关系,并尝试从 220 万份 NLP 处理的临床笔记语料库中重新发现这些关系。我们使用标准调整和因果推断程序,通过向模型提供不同数量的文献衍生混杂因素并将模型中的暴露、结局和混杂因素变量实例化为来自 EHR 的二值数据,来预测和估计因果效应。最后,我们将这些方法的结果与关联的简单度量(χ 和报告比值比)以及彼此进行了比较。
我们发现基于语义向量的搜索在减少混杂偏差方面优于基于字符串的搜索。然而,纳入更多而非更少文献衍生混杂因素的效果尚无定论。我们建议使用针对性学习估计方法,这些方法可以解决混杂因素也作为中间变量的治疗混杂反馈问题,并聘请主题专家来裁定有问题的协变量的处理。