Patel Tejal A, Puppala Mamta, Ogunti Richard O, Ensor Joe E, He Tiancheng, Shewale Jitesh B, Ankerst Donna P, Kaklamani Virginia G, Rodriguez Angel A, Wong Stephen T C, Chang Jenny C
Houston Methodist Cancer Center, Houston, Texas.
Cancer Research Program, Houston Methodist Research Institute, Houston, Texas.
Cancer. 2017 Jan 1;123(1):114-121. doi: 10.1002/cncr.30245. Epub 2016 Aug 29.
A key challenge to mining electronic health records for mammography research is the preponderance of unstructured narrative text, which strikingly limits usable output. The imaging characteristics of breast cancer subtypes have been described previously, but without standardization of parameters for data mining.
The authors searched the enterprise-wide data warehouse at the Houston Methodist Hospital, the Methodist Environment for Translational Enhancement and Outcomes Research (METEOR), for patients with Breast Imaging Reporting and Data System (BI-RADS) category 5 mammogram readings performed between January 2006 and May 2015 and an available pathology report. The authors developed natural language processing (NLP) software algorithms to automatically extract mammographic and pathologic findings from free text mammogram and pathology reports. The correlation between mammographic imaging features and breast cancer subtype was analyzed using one-way analysis of variance and the Fisher exact test.
The NLP algorithm was able to obtain key characteristics for 543 patients who met the inclusion criteria. Patients with estrogen receptor-positive tumors were more likely to have spiculated margins (P = .0008), and those with tumors that overexpressed human epidermal growth factor receptor 2 (HER2) were more likely to have heterogeneous and pleomorphic calcifications (P = .0078 and P = .0002, respectively).
Mammographic imaging characteristics, obtained from an automated text search and the extraction of mammogram reports using NLP techniques, correlated with pathologic breast cancer subtype. The results of the current study validate previously reported trends assessed by manual data collection. Furthermore, NLP provides an automated means with which to scale up data extraction and analysis for clinical decision support. Cancer 2017;114-121. © 2016 American Cancer Society.
在利用电子健康记录进行乳房X光摄影研究时,一个关键挑战是存在大量非结构化的叙述性文本,这极大地限制了可用输出。先前已描述了乳腺癌亚型的影像学特征,但数据挖掘参数未实现标准化。
作者在休斯顿卫理公会医院的企业级数据仓库——卫理公会转化增强与结果研究环境(METEOR)中,搜索了2006年1月至2015年5月期间进行乳房影像报告和数据系统(BI-RADS)5类乳房X光摄影读数且有可用病理报告的患者。作者开发了自然语言处理(NLP)软件算法,以从乳房X光摄影和病理报告的自由文本中自动提取乳房X光摄影和病理结果。使用单因素方差分析和Fisher精确检验分析乳房X光摄影特征与乳腺癌亚型之间的相关性。
NLP算法能够为543名符合纳入标准的患者获取关键特征。雌激素受体阳性肿瘤患者更有可能出现毛刺状边缘(P = .0008),而人表皮生长因子受体2(HER2)过表达肿瘤患者更有可能出现不均匀及多形性钙化(分别为P = .0078和P = .0002)。
通过自动文本搜索和使用NLP技术提取乳房X光摄影报告获得的乳房X光摄影特征与病理乳腺癌亚型相关。本研究结果验证了先前通过手动数据收集评估的趋势。此外NLP提供了一种自动化手段,可扩大数据提取和分析规模以用于临床决策支持。《癌症》2017;114 - 121。© 2016美国癌症协会