Information Operations and Technology Management, John B. and Lillian E. Neff College of Business and Innovation, The University of Toledo, USA.
Gary W. Rollins College of Business, The University of Tennessee at Chattanooga, USA.
Health Informatics J. 2021 Jan-Mar;27(1):1460458221989392. doi: 10.1177/1460458221989392.
A natural language processing (NLP) application requires sophisticated lexical resources to support its processing goals. Different solutions, such as dictionary lookup and MetaMap, have been proposed in the healthcare informatics literature to identify disease terms with more than one word (multi-gram disease named entities). Although a lot of work has been done in the identification of protein- and gene-named entities in the biomedical field, not much research has been done on the recognition and resolution of terminologies in the clinical trial subject eligibility analysis. In this study, we develop a specialized lexicon for improving NLP and text mining analysis in the breast cancer domain, and evaluate it by comparing it with the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT). We use a hybrid methodology, which combines the knowledge of domain experts, terms from multiple online dictionaries, and the mining of text from sample clinical trials. Use of our methodology introduces 4243 unique lexicon items, which increase bigram entity match by 38.6% and trigram entity match by 41%. Our lexicon, which adds a significant number of new terms, is very useful for matching patients to clinical trials automatically based on eligibility matching. Beyond clinical trial matching, the specialized lexicon developed in this study could serve as a foundation for future healthcare text mining applications.
自然语言处理(NLP)应用程序需要复杂的词汇资源来支持其处理目标。在医疗信息学文献中,已经提出了不同的解决方案,如字典查找和 MetaMap,以识别具有多个单词的疾病术语(多词疾病命名实体)。虽然在生物医学领域的蛋白质和基因命名实体的识别方面已经做了很多工作,但在临床试验受试者资格分析中术语的识别和解析方面的研究却很少。在这项研究中,我们开发了一个专门的词汇表,用于改进乳腺癌领域的 NLP 和文本挖掘分析,并通过与系统医学命名法临床术语(SNOMED CT)进行比较来评估它。我们使用混合方法,结合领域专家的知识、来自多个在线词典的术语以及从示例临床试验中挖掘文本。我们的方法使用了 4243 个独特的词汇项,将双词实体匹配提高了 38.6%,将三词实体匹配提高了 41%。我们的词汇表增加了大量新术语,对于根据资格匹配自动将患者与临床试验匹配非常有用。除了临床试验匹配之外,本研究开发的专业词汇表还可以作为未来医疗保健文本挖掘应用的基础。