Department of Pediatrics, University of Michigan Medical School, Ann Arbor, Michigan, USA.
Department of Internal Medicine, University of Michigan Medical School, Ann Arbor, Michigan, USA.
J Am Med Inform Assoc. 2014 Sep-Oct;21(5):925-37. doi: 10.1136/amiajnl-2014-002767. Epub 2014 Jun 13.
We describe experiments designed to determine the feasibility of distinguishing known from novel associations based on a clinical dataset comprised of International Classification of Disease, V.9 (ICD-9) codes from 1.6 million patients by comparing them to associations of ICD-9 codes derived from 20.5 million Medline citations processed using MetaMap. Associations appearing only in the clinical dataset, but not in Medline citations, are potentially novel.
Pairwise associations of ICD-9 codes were independently identified in both the clinical and Medline datasets, which were then compared to quantify their degree of overlap. We also performed a manual review of a subset of the associations to validate how well MetaMap performed in identifying diagnoses mentioned in Medline citations that formed the basis of the Medline associations.
The overlap of associations based on ICD-9 codes in the clinical and Medline datasets was low: only 6.6% of the 3.1 million associations found in the clinical dataset were also present in the Medline dataset. Further, a manual review of a subset of the associations that appeared in both datasets revealed that co-occurring diagnoses from Medline citations do not always represent clinically meaningful associations.
Identifying novel associations derived from large clinical datasets remains challenging. Medline as a sole data source for existing knowledge may not be adequate to filter out widely known associations.
In this study, novel associations were not readily identified. Further improvements in accuracy and relevance for tools such as MetaMap are needed to realize their expected utility.
我们描述了一些实验,旨在通过将 160 万患者的国际疾病分类第 9 版(ICD-9)代码与使用 MetaMap 处理的 2050 万篇 Medline 引文的 ICD-9 代码关联进行比较,从包含 ICD-9 代码的临床数据集来确定区分已知和新关联的可行性。仅在临床数据集中出现而不在 Medline 引文中出现的关联可能是新的。
在临床和 Medline 数据集中独立识别 ICD-9 代码的成对关联,然后进行比较以量化它们的重叠程度。我们还对关联的一个子集进行了手动审查,以验证 MetaMap 在识别 Medline 引文中提到的、构成 Medline 关联基础的诊断方面的性能如何。
临床和 Medline 数据集基于 ICD-9 代码的关联重叠率较低:在临床数据集中发现的 310 万关联中,只有 6.6%也存在于 Medline 数据集中。此外,对同时出现在两个数据集的关联的一个子集进行手动审查表明,Medline 引文中同时出现的诊断并不总是代表有临床意义的关联。
从大型临床数据集识别新的关联仍然具有挑战性。Medline 作为现有知识的唯一数据源,可能不足以过滤掉广泛已知的关联。
在这项研究中,新的关联不容易被识别。需要进一步提高工具(如 MetaMap)的准确性和相关性,以实现其预期的效用。