Feng Yuhao, Qi Lei, Tian Weidong
IEEE/ACM Trans Comput Biol Bioinform. 2023 Mar-Apr;20(2):1269-1277. doi: 10.1109/TCBB.2022.3170301. Epub 2023 Apr 3.
Automated recognition of Human Phenotype Ontology (HPO) terms from clinical texts is of significant interest to the field of clinical data mining. In this study, we develop a combined deep learning method named PhenoBERT for this purpose. PhenoBERT uses BERT, currently the state-of-the-art NLP model, as its core model for evaluating whether a clinically relevant text segment (CTS) could be represented by an HPO term. However, to avoid unnecessary comparison of a CTS with each of ∼14,000 HPO terms using BERT, we introduce a two-levels CNN module consisting of a series of CNN models organized at two levels in PhenoBERT. For a given CTS, the CNN module produces only a short list of candidate HPO terms for BERT to evaluate, significantly improving the computational efficiency. In addition, BERT is able to assign an ancestor HPO term to a CTS when recognition of the direct HPO term is not successful, mimicking the process of HPO term assignment by human. In two benchmarks, PhenoBERT outperforms four traditional dictionary-based methods and two recently developed deep learning-based methods in two benchmark tests, and its advantage is more obvious when the recognition task is more challenging. As such, PhenoBERT is of great use for assisting in the mining of clinical text data.
从临床文本中自动识别人类表型本体(HPO)术语是临床数据挖掘领域非常感兴趣的研究内容。在本研究中,我们为此开发了一种名为PhenoBERT的深度学习组合方法。PhenoBERT使用目前最先进的自然语言处理(NLP)模型BERT作为其核心模型,用于评估一个临床相关文本片段(CTS)是否可以由一个HPO术语来表示。然而,为了避免使用BERT将一个CTS与大约14000个HPO术语逐一进行不必要的比较,我们在PhenoBERT中引入了一个两级卷积神经网络(CNN)模块,该模块由一系列在两个层次上组织的CNN模型组成。对于给定的CTS,CNN模块只为BERT生成一个候选HPO术语的短列表以供评估,从而显著提高计算效率。此外,当直接的HPO术语识别不成功时,BERT能够为一个CTS分配一个祖先HPO术语,这模仿了人类进行HPO术语分配的过程。在两个基准测试中,PhenoBERT在两项基准测试中优于四种传统的基于词典的方法和两种最近开发的基于深度学习的方法,并且当识别任务更具挑战性时,其优势更加明显。因此,PhenoBERT在协助挖掘临床文本数据方面具有很大的用途。