Sheikhalishahi Seyedmostafa, Miotto Riccardo, Dudley Joel T, Lavelli Alberto, Rinaldi Fabio, Osmani Venet
eHealth Research Group, Fondazione Bruno Kessler Research Institute, Trento, Italy.
Department of Information Engineering and Computer Science, University of Trento, Trento, Italy.
JMIR Med Inform. 2019 Apr 27;7(2):e12239. doi: 10.2196/12239.
Novel approaches that complement and go beyond evidence-based medicine are required in the domain of chronic diseases, given the growing incidence of such conditions on the worldwide population. A promising avenue is the secondary use of electronic health records (EHRs), where patient data are analyzed to conduct clinical and translational research. Methods based on machine learning to process EHRs are resulting in improved understanding of patient clinical trajectories and chronic disease risk prediction, creating a unique opportunity to derive previously unknown clinical insights. However, a wealth of clinical histories remains locked behind clinical narratives in free-form text. Consequently, unlocking the full potential of EHR data is contingent on the development of natural language processing (NLP) methods to automatically transform clinical text into structured clinical data that can guide clinical decisions and potentially delay or prevent disease onset.
The goal of the research was to provide a comprehensive overview of the development and uptake of NLP methods applied to free-text clinical notes related to chronic diseases, including the investigation of challenges faced by NLP methodologies in understanding clinical narratives.
Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines were followed and searches were conducted in 5 databases using "clinical notes," "natural language processing," and "chronic disease" and their variations as keywords to maximize coverage of the articles.
Of the 2652 articles considered, 106 met the inclusion criteria. Review of the included papers resulted in identification of 43 chronic diseases, which were then further classified into 10 disease categories using the International Classification of Diseases, 10th Revision. The majority of studies focused on diseases of the circulatory system (n=38) while endocrine and metabolic diseases were fewest (n=14). This was due to the structure of clinical records related to metabolic diseases, which typically contain much more structured data, compared with medical records for diseases of the circulatory system, which focus more on unstructured data and consequently have seen a stronger focus of NLP. The review has shown that there is a significant increase in the use of machine learning methods compared to rule-based approaches; however, deep learning methods remain emergent (n=3). Consequently, the majority of works focus on classification of disease phenotype with only a handful of papers addressing extraction of comorbidities from the free text or integration of clinical notes with structured data. There is a notable use of relatively simple methods, such as shallow classifiers (or combination with rule-based methods), due to the interpretability of predictions, which still represents a significant issue for more complex methods. Finally, scarcity of publicly available data may also have contributed to insufficient development of more advanced methods, such as extraction of word embeddings from clinical notes.
Efforts are still required to improve (1) progression of clinical NLP methods from extraction toward understanding; (2) recognition of relations among entities rather than entities in isolation; (3) temporal extraction to understand past, current, and future clinical events; (4) exploitation of alternative sources of clinical knowledge; and (5) availability of large-scale, de-identified clinical corpora.
鉴于慢性病在全球人口中的发病率不断上升,慢性病领域需要补充并超越循证医学的新方法。一个有前景的途径是电子健康记录(EHR)的二次利用,即对患者数据进行分析以开展临床和转化研究。基于机器学习处理EHR的方法有助于更好地理解患者的临床轨迹和慢性病风险预测,为获取此前未知的临床见解创造了独特机会。然而,大量临床病史仍隐藏在自由格式文本的临床叙述中。因此,要充分发挥EHR数据的潜力,取决于自然语言处理(NLP)方法的发展,以自动将临床文本转化为可指导临床决策并可能延缓或预防疾病发作的结构化临床数据。
本研究的目的是全面概述应用于慢性病相关自由文本临床记录的NLP方法的发展和应用情况,包括调查NLP方法在理解临床叙述时面临的挑战。
遵循系统评价和Meta分析的首选报告项目(PRISMA)指南,在5个数据库中进行检索,使用“临床记录”“自然语言处理”和“慢性病”及其变体作为关键词,以最大限度地覆盖相关文章。
在考虑的2652篇文章中,106篇符合纳入标准。对纳入论文的审查确定了43种慢性病,然后使用国际疾病分类第十版将其进一步分为10个疾病类别。大多数研究集中在循环系统疾病(n = 38),而内分泌和代谢疾病最少(n = 14)。这是由于与代谢疾病相关的临床记录结构,其通常包含更多结构化数据,而循环系统疾病的医疗记录更多地关注非结构化数据,因此NLP的关注重点更强。审查表明,与基于规则的方法相比,机器学习方法的使用显著增加;然而,深度学习方法仍处于起步阶段(n = 3)。因此,大多数研究集中在疾病表型分类上,只有少数论文涉及从自由文本中提取合并症或临床记录与结构化数据的整合。由于预测的可解释性,相对简单的方法(如浅层分类器或与基于规则的方法结合)得到了显著应用,这对更复杂的方法来说仍然是一个重大问题。最后,公开可用数据的稀缺也可能导致了更先进方法(如从临床记录中提取词嵌入)的开发不足。
仍需努力改进:(1)临床NLP方法从提取向理解的进展;(2)对实体之间关系而非孤立实体的识别;(3)时间提取以理解过去、当前和未来的临床事件;(4)利用替代临床知识来源;(5)大规模、去识别化临床语料库的可用性。