Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA.
Department of Pharmaceutical Care and Health Systems, University of Minnesota, Minneapolis, Minnesota, USA.
J Am Med Inform Assoc. 2020 Oct 1;27(10):1547-1555. doi: 10.1093/jamia/ocaa128.
We sought to assess the need for additional coverage of dietary supplements (DS) in the Unified Medical Language System (UMLS) by investigating (1) the overlap between the integrated DIetary Supplements Knowledge base (iDISK) DS ingredient terminology and the UMLS and (2) the coverage of iDISK and the UMLS over DS mentions in the biomedical literature.
We estimated the overlap between iDISK and the UMLS by mapping iDISK to the UMLS using exact and normalized strings. The coverage of iDISK and the UMLS over DS mentions in the biomedical literature was evaluated via a DS named-entity recognition (NER) task within PubMed abstracts.
The coverage analysis revealed that only 30% of iDISK terms can be matched to the UMLS, although these cover over 99% of iDISK concepts. A manual review revealed that a majority of the unmatched terms represented new synonyms, rather than lexical variants. For NER, iDISK nearly doubles the precision and achieves a higher F1 score than the UMLS, while maintaining a competitive recall.
While iDISK has significant concept overlap with the UMLS, it contains many novel synonyms. Furthermore, almost 3000 of these overlapping UMLS concepts are missing a DS designation, which could be provided by iDISK. The NER experiments show that the specialization of iDISK is useful for identifying DS mentions.
Our results show that the DS representation in the UMLS could be enriched by adding DS designations to many concepts and by adding new synonyms.
通过调查(1)综合饮食补充剂知识库(iDISK)中的饮食补充剂(DS)成分术语与统一医学语言系统(UMLS)之间的重叠,以及(2)iDISK 和 UMLS 对生物医学文献中 DS 提及的覆盖范围,评估 UMLS 中 DS 内容的额外覆盖需求。
我们通过使用精确和归一化字符串将 iDISK 映射到 UMLS 来估计 iDISK 和 UMLS 之间的重叠。通过 PubMed 摘要中的 DS 命名实体识别(NER)任务评估 iDISK 和 UMLS 对生物医学文献中 DS 提及的覆盖范围。
覆盖范围分析表明,只有 30%的 iDISK 术语可以与 UMLS 匹配,尽管这些术语涵盖了超过 99%的 iDISK 概念。手动审查显示,大多数不匹配的术语代表新的同义词,而不是词汇变体。对于 NER,iDISK 的精度几乎是 UMLS 的两倍,并且达到了更高的 F1 分数,同时保持了有竞争力的召回率。
虽然 iDISK 与 UMLS 有很大的概念重叠,但它包含许多新的同义词。此外,几乎 3000 个重叠的 UMLS 概念缺少 DS 标记,这可以由 iDISK 提供。NER 实验表明,iDISK 的专业化对于识别 DS 提及非常有用。
我们的结果表明,通过向许多概念添加 DS 标记并添加新的同义词,可以丰富 UMLS 中的 DS 表示。