使用结合生物医学文献和本体结构化知识的知识图进行因果特征选择:以抑郁症作为阿尔茨海默病风险因素为例的研究。

Causal feature selection using a knowledge graph combining structured knowledge from the biomedical literature and ontologies: A use case studying depression as a risk factor for Alzheimer's disease.

机构信息

Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA.

Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA.

出版信息

J Biomed Inform. 2023 Jun;142:104368. doi: 10.1016/j.jbi.2023.104368. Epub 2023 Apr 21.

Abstract

BACKGROUND

Causal feature selection is essential for estimating effects from observational data. Identifying confounders is a crucial step in this process. Traditionally, researchers employ content-matter expertise and literature review to identify confounders. Uncontrolled confounding from unidentified confounders threatens validity, conditioning on intermediate variables (mediators) weakens estimates, and conditioning on common effects (colliders) induces bias. Additionally, without special treatment, erroneous conditioning on variables combining roles introduces bias. However, the vast literature is growing exponentially, making it infeasible to assimilate this knowledge. To address these challenges, we introduce a novel knowledge graph (KG) application enabling causal feature selection by combining computable literature-derived knowledge with biomedical ontologies. We present a use case of our approach specifying a causal model for estimating the total causal effect of depression on the risk of developing Alzheimer's disease (AD) from observational data.

METHODS

We extracted computable knowledge from a literature corpus using three machine reading systems and inferred missing knowledge using logical closure operations. Using a KG framework, we mapped the output to target terminologies and combined it with ontology-grounded resources. We translated epidemiological definitions of confounder, collider, and mediator into queries for searching the KG and summarized the roles played by the identified variables. We compared the results with output from a complementary method and published observational studies and examined a selection of confounding and combined role variables in-depth.

RESULTS

Our search identified 128 confounders, including 58 phenotypes, 47 drugs, 35 genes, 23 collider, and 16 mediator phenotypes. However, only 31 of the 58 confounder phenotypes were found to behave exclusively as confounders, while the remaining 27 phenotypes played other roles. Obstructive sleep apnea emerged as a potential novel confounder for depression and AD. Anemia exemplified a variable playing combined roles.

CONCLUSION

Our findings suggest combining machine reading and KG could augment human expertise for causal feature selection. However, the complexity of causal feature selection for depression with AD highlights the need for standardized field-specific databases of causal variables. Further work is needed to optimize KG search and transform the output for human consumption.

摘要

背景

因果特征选择对于从观察性数据中估计效果至关重要。识别混杂因素是这一过程中的关键步骤。传统上,研究人员利用专业知识和文献综述来识别混杂因素。未被识别的混杂因素会导致无效性,对中间变量(中介)进行条件处理会削弱估计值,对常见效应(共发)进行条件处理会产生偏差。此外,如果不对同时具有多种作用的变量进行特殊处理,则错误的条件处理会引入偏差。然而,庞大的文献呈指数级增长,使得吸收这些知识变得不可行。为了解决这些挑战,我们引入了一种新的知识图谱(KG)应用,通过将可计算的文献衍生知识与生物医学本体相结合,实现因果特征选择。我们提出了一个应用案例,指定了一个因果模型,用于从观察性数据中估计抑郁对阿尔茨海默病(AD)发病风险的总因果效应。

方法

我们使用三个机器阅读系统从文献语料库中提取可计算知识,并使用逻辑闭包操作推断缺失知识。使用 KG 框架,我们将输出映射到目标术语,并将其与本体基础资源相结合。我们将混杂因素、共发和中介的流行病学定义转换为搜索 KG 的查询,并总结了所识别变量的作用。我们将结果与互补方法和已发表的观察性研究的输出进行了比较,并深入研究了选择的混杂因素和同时具有多种作用的变量。

结果

我们的搜索确定了 128 个混杂因素,包括 58 个表型、47 种药物、35 个基因、23 个共发和 16 个中介表型。然而,只有 58 个混杂因素表型中的 31 个被发现仅作为混杂因素起作用,而其余 27 个表型则起其他作用。阻塞性睡眠呼吸暂停症(obstructive sleep apnea)成为抑郁和 AD 的一个潜在新混杂因素。贫血则是一个同时具有多种作用的变量的例子。

结论

我们的研究结果表明,结合机器阅读和 KG 可以增强人类在因果特征选择方面的专业知识。然而,AD 与抑郁相关的因果特征选择的复杂性突出表明需要针对特定领域的因果变量建立标准化数据库。需要进一步工作来优化 KG 搜索并将输出转化为人类可接受的形式。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索