Center for Data Science, New York University, New York, NY, USA.
Department of Computer Science, New York University, New York, NY, USA.
Sci Rep. 2017 Jul 20;7(1):5994. doi: 10.1038/s41598-017-05778-z.
Demand for clinical decision support systems in medicine and self-diagnostic symptom checkers has substantially increased in recent years. Existing platforms rely on knowledge bases manually compiled through a labor-intensive process or automatically derived using simple pairwise statistics. This study explored an automated process to learn high quality knowledge bases linking diseases and symptoms directly from electronic medical records. Medical concepts were extracted from 273,174 de-identified patient records and maximum likelihood estimation of three probabilistic models was used to automatically construct knowledge graphs: logistic regression, naive Bayes classifier and a Bayesian network using noisy OR gates. A graph of disease-symptom relationships was elicited from the learned parameters and the constructed knowledge graphs were evaluated and validated, with permission, against Google's manually-constructed knowledge graph and against expert physician opinions. Our study shows that direct and automated construction of high quality health knowledge graphs from medical records using rudimentary concept extraction is feasible. The noisy OR model produces a high quality knowledge graph reaching precision of 0.85 for a recall of 0.6 in the clinical evaluation. Noisy OR significantly outperforms all tested models across evaluation frameworks (p < 0.01).
近年来,医学领域对临床决策支持系统和自我诊断症状检查器的需求大幅增加。现有的平台依赖于通过劳动密集型过程手动编制的知识库,或者使用简单的两两统计数据自动推导。本研究探索了一种从电子病历中直接学习将疾病与症状相关联的高质量知识库的自动化方法。从 273174 份去标识患者记录中提取了医学概念,并使用最大似然估计对三个概率模型进行了自动构建:逻辑回归、朴素贝叶斯分类器和使用噪声或门的贝叶斯网络。从学习到的参数中引出了疾病-症状关系图,并在获得许可的情况下,根据谷歌的手动构建知识图和专家医生的意见对构建的知识库进行了评估和验证。我们的研究表明,使用基本的概念提取,直接从病历中自动构建高质量的健康知识库是可行的。噪声或模型生成了一个高质量的知识库,在临床评估中,召回率为 0.6 时的精度达到 0.85。在所有评估框架中(p < 0.01),噪声或模型的表现均显著优于所有测试模型。