Newman-Griffis Denis, Fosler-Lussier Eric
Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, USA.
Epidemiology & Biostatistics Section, Rehabilitation Medicine Department, National Institutes of Health Clinical Center, Bethesda, Maryland, USA.
Front Digit Health. 2021 Mar;3. doi: 10.3389/fdgth.2021.620828. Epub 2021 Mar 10.
Linking clinical narratives to standardized vocabularies and coding systems is a key component of unlocking the information in medical text for analysis. However, many domains of medical concepts, such as functional outcomes and social determinants of health, lack well-developed terminologies that can support effective coding of medical text. We present a framework for developing natural language processing (NLP) technologies for automated coding of medical information in under-studied domains, and demonstrate its applicability through a case study on physical mobility function. Mobility function is a component of many health measures, from post-acute care and surgical outcomes to chronic frailty and disability, and is represented as one domain of human activity in the International Classification of Functioning, Disability, and Health (ICF). However, mobility and other types of functional activity remain under-studied in the medical informatics literature, and neither the ICF nor commonly-used medical terminologies capture functional status terminology in practice. We investigated two data-driven paradigms, classification and candidate selection, to link narrative observations of mobility status to standardized ICF codes, using a dataset of clinical narratives from physical therapy encounters. Recent advances in language modeling and word embedding were used as features for established machine learning models and a novel deep learning approach, achieving a macro-averaged F-1 score of 84% on linking mobility activity reports to ICF codes. Both classification and candidate selection approaches present distinct strengths for automated coding in under-studied domains, and we highlight that the combination of (i) a small annotated data set; (ii) expert definitions of codes of interest; and (iii) a representative text corpus is sufficient to produce high-performing automated coding systems. This research has implications for continued development of language technologies to analyze functional status information, and the ongoing growth of NLP tools for a variety of specialized applications in clinical care and research.
将临床叙述与标准化词汇表和编码系统相链接,是解锁医学文本信息以进行分析的关键组成部分。然而,许多医学概念领域,如功能结局和健康的社会决定因素,缺乏能够支持有效编码医学文本的完善术语。我们提出了一个用于开发自然语言处理(NLP)技术的框架,以对研究较少领域的医学信息进行自动编码,并通过一项关于身体活动功能的案例研究来证明其适用性。活动功能是许多健康指标的一个组成部分,从急性后期护理和手术结局到慢性衰弱和残疾,并且在《国际功能、残疾和健康分类》(ICF)中被表示为人类活动的一个领域。然而,活动及其他类型的功能活动在医学信息学文献中仍研究不足,而且无论是ICF还是常用的医学术语在实践中都未涵盖功能状态术语。我们研究了两种数据驱动范式,即分类和候选选择,以将活动状态的叙述性观察与标准化的ICF编码相链接,使用了来自物理治疗会诊的临床叙述数据集。语言建模和词嵌入的最新进展被用作既定机器学习模型和一种新颖深度学习方法的特征,在将活动报告与ICF编码相链接方面实现了84%的宏平均F1分数。分类和候选选择方法在研究较少的领域进行自动编码时都具有明显优势,并且我们强调(i)一个小的带注释数据集;(ii)感兴趣编码的专家定义;以及(iii)一个有代表性的文本语料库的组合足以产生高性能的自动编码系统。这项研究对于持续开发用于分析功能状态信息的语言技术以及NLP工具在临床护理和研究中各种专门应用的持续增长具有重要意义。