Peking University School and Hospital of Stomatology; National Engineering Laboratory for Digital and Material.
Technology of Stomatology; Peking University; National Clinical Research Center for Oral Diseases, PR China.
Health Informatics J. 2021 Jan-Mar;27(1):1460458220980036. doi: 10.1177/1460458220980036.
Extracting information from unstructured clinical text is a fundamental and challenging task in medical informatics. Our study aims to construct a natural language processing (NLP) workflow to extract information from Chinese electronic dental records (EDRs) for clinical decision support systems (CDSSs). We extracted attributes, attribute values, and tooth positions based on an existing ontology from EDRs. A workflow integrating deep learning with keywords was constructed, in which vectors representing texts were unsupervised learned. Specifically, we implemented Sentence2vec to learn sentence vectors and Word2vec to learn word vectors. For attribute recognition, we calculated similarity values among sentence vectors and extracted attributes based on our selection strategy. For attribute value recognition, we expanded the keyword database by calculating similarity values among word vectors to select keywords. Performance of our workflow with the hybrid method was evaluated and compared with keyword-based method and deep learning method. In both attribute and value recognition, the hybrid method outperforms the other two methods in achieving high precision (0.94, 0.94), recall (0.74, 0.82), and score (0.83, 0.88). Our NLP workflow can efficiently structure narrative text from EDRs, providing accurate input information and a solid foundation for further data-based CDSSs.
从非结构化的临床文本中提取信息是医学信息学中的一个基本且具有挑战性的任务。我们的研究旨在构建一个自然语言处理(NLP)工作流程,以便从中文电子牙科记录(EDR)中提取信息,用于临床决策支持系统(CDSS)。我们根据现有的本体从 EDR 中提取属性、属性值和牙齿位置。构建了一个集成深度学习和关键字的工作流程,其中代表文本的向量是无监督学习的。具体来说,我们实现了 Sentence2vec 来学习句子向量,以及 Word2vec 来学习单词向量。对于属性识别,我们计算句子向量之间的相似度值,并根据我们的选择策略提取属性。对于属性值识别,我们通过计算单词向量之间的相似度值来扩展关键字数据库,以选择关键字。我们的混合方法的工作流程的性能进行了评估,并与基于关键字的方法和深度学习方法进行了比较。在属性和值识别中,混合方法在实现高精度(0.94、0.94)、召回率(0.74、0.82)和 F1 得分(0.83、0.88)方面均优于其他两种方法。我们的 NLP 工作流程可以有效地从 EDR 中构建叙述性文本,为进一步基于数据的 CDSS 提供准确的输入信息和坚实的基础。