Wang Liwei, Ruan Xiaoyang, Yang Ping, Liu Hongfang
Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.
Cancer Inform. 2016 Dec 8;15:237-242. doi: 10.4137/CIN.S40604. eCollection 2016.
The primary aim was to compare independent and joint performance of retrieving smoking status through different sources, including narrative text processed by natural language processing (NLP), patient-provided information (PPI), and diagnosis codes (ie, International Classification of Diseases, Ninth Revision [ICD-9]). We also compared the performance of retrieving smoking strength information (ie, heavy/light smoker) from narrative text and PPI.
Our study leveraged an existing lung cancer cohort for smoking status, amount, and strength information, which was manually chart-reviewed. On the NLP side, smoking-related electronic medical record (EMR) data were retrieved first. A pattern-based smoking information extraction module was then implemented to extract smoking-related information. After that, heuristic rules were used to obtain smoking status-related information. Smoking information was also obtained from structured data sources based on diagnosis codes and PPI. Sensitivity, specificity, and accuracy were measured using patients with coverage (ie, the proportion of patients whose smoking status/strength can be effectively determined).
NLP alone has the best overall performance for smoking status extraction (patient coverage: 0.88; sensitivity: 0.97; specificity: 0.70; accuracy: 0.88); combining PPI with NLP further improved patient coverage to 0.96. ICD-9 does not provide additional improvement to NLP and its combination with PPI. For smoking strength, combining NLP with PPI has slight improvement over NLP alone.
These findings suggest that narrative text could serve as a more reliable and comprehensive source for obtaining smoking-related information than structured data sources. PPI, the readily available structured data, could be used as a complementary source for more comprehensive patient coverage.
主要目的是比较通过不同来源检索吸烟状况的独立和联合性能,这些来源包括经自然语言处理(NLP)处理的叙述性文本、患者提供的信息(PPI)和诊断代码(即国际疾病分类第九版[ICD-9])。我们还比较了从叙述性文本和PPI中检索吸烟强度信息(即重度/轻度吸烟者)的性能。
我们的研究利用了一个现有的肺癌队列,以获取吸烟状况、数量和强度信息,并进行了人工图表审查。在NLP方面,首先检索与吸烟相关的电子病历(EMR)数据。然后实施基于模式的吸烟信息提取模块来提取与吸烟相关的信息。之后,使用启发式规则获取与吸烟状况相关的信息。吸烟信息也从基于诊断代码和PPI的结构化数据源中获取。使用有覆盖范围的患者(即吸烟状况/强度可有效确定的患者比例)来测量敏感性、特异性和准确性。
仅NLP在吸烟状况提取方面具有最佳的总体性能(患者覆盖范围:0.88;敏感性:0.97;特异性:0.70;准确性:0.88);将PPI与NLP相结合可进一步将患者覆盖范围提高到0.96。ICD-9对NLP及其与PPI的组合没有提供额外的改进。对于吸烟强度,将NLP与PPI相结合比单独使用NLP略有改进。
这些发现表明,与结构化数据源相比,叙述性文本可以作为获取吸烟相关信息的更可靠和全面的来源。PPI作为易于获得的结构化数据,可以用作补充来源,以实现更全面的患者覆盖。