National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA.
J Am Med Inform Assoc. 2023 Nov 17;30(12):1887-1894. doi: 10.1093/jamia/ocad152.
Use heuristic, deep learning (DL), and hybrid AI methods to predict semantic group (SG) assignments for new UMLS Metathesaurus atoms, with target accuracy ≥95%.
We used train-test datasets from successive 2020AA-2022AB UMLS Metathesaurus releases. Our heuristic "waterfall" approach employed a sequence of 7 different SG prediction methods. Atoms not qualifying for a method were passed on to the next method. The DL approach generated BioWordVec and SapBERT embeddings for atom names, BioWordVec embeddings for source vocabulary names, and BioWordVec embeddings for atom names of the second-to-top nodes of an atom's source hierarchy. We fed a concatenation of the 4 embeddings into a fully connected multilayer neural network with an output layer of 15 nodes (one for each SG). For both approaches, we developed methods to estimate the probability that their predicted SG for an atom would be correct. Based on these estimations, we developed 2 hybrid SG prediction methods combining the strengths of heuristic and DL methods.
The heuristic waterfall approach accurately predicted 94.3% of SGs for 1 563 692 new unseen atoms. The DL accuracy on the same dataset was also 94.3%. The hybrid approaches achieved an average accuracy of 96.5%.
Our study demonstrated that AI methods can predict SG assignments for new UMLS atoms with sufficient accuracy to be potentially useful as an intermediate step in the time-consuming task of assigning new atoms to UMLS concepts. We showed that for SG prediction, combining heuristic methods and DL methods can produce better results than either alone.
使用启发式、深度学习 (DL) 和混合人工智能方法来预测新 UMLS 元词表原子的语义组 (SG) 分配,目标准确率≥95%。
我们使用了来自连续 2020AA-2022AB UMLS 元词表发布的训练-测试数据集。我们的启发式“瀑布”方法采用了 7 种不同的 SG 预测方法的序列。不符合方法要求的原子将传递给下一个方法。DL 方法为原子名称生成了 BioWordVec 和 SapBERT 嵌入,为源词汇名称生成了 BioWordVec 嵌入,为原子源层次结构中第二个最高节点的原子名称生成了 BioWordVec 嵌入。我们将 4 个嵌入的串联输入到一个具有 15 个节点(每个 SG 一个)的全连接多层神经网络中。对于这两种方法,我们都开发了一种方法来估计它们对原子的预测 SG 正确的概率。基于这些估计,我们开发了 2 种混合 SG 预测方法,结合了启发式和 DL 方法的优势。
启发式瀑布方法准确预测了 1563692 个新未见原子的 94.3%的 SG。相同数据集上的 DL 准确率也是 94.3%。混合方法的平均准确率达到了 96.5%。
我们的研究表明,人工智能方法可以足够准确地预测新 UMLS 原子的 SG 分配,这对于将新原子分配给 UMLS 概念这一耗时任务来说,可能是一个有用的中间步骤。我们表明,对于 SG 预测,结合启发式方法和 DL 方法可以产生比单独使用任何一种方法更好的结果。