Suppr超能文献

用于宫颈癌和肛门癌及癌前病变监测的自然语言处理:算法开发与分割验证研究

Natural Language Processing for Surveillance of Cervical and Anal Cancer and Precancer: Algorithm Development and Split-Validation Study.

作者信息

Oliveira Carlos R, Niccolai Patrick, Ortiz Anette Michelle, Sheth Sangini S, Shapiro Eugene D, Niccolai Linda M, Brandt Cynthia A

机构信息

Department of Pediatrics, Yale University School of Medicine, New Haven, CT, United States.

Department of Obstetrics, Gynecology, and Reproductive Sciences, Yale University School of Medicine, New Haven, CT, United States.

出版信息

JMIR Med Inform. 2020 Nov 3;8(11):e20826. doi: 10.2196/20826.

Abstract

BACKGROUND

Accurate identification of new diagnoses of human papillomavirus-associated cancers and precancers is an important step toward the development of strategies that optimize the use of human papillomavirus vaccines. The diagnosis of human papillomavirus cancers hinges on a histopathologic report, which is typically stored in electronic medical records as free-form, or unstructured, narrative text. Previous efforts to perform surveillance for human papillomavirus cancers have relied on the manual review of pathology reports to extract diagnostic information, a process that is both labor- and resource-intensive. Natural language processing can be used to automate the structuring and extraction of clinical data from unstructured narrative text in medical records and may provide a practical and effective method for identifying patients with vaccine-preventable human papillomavirus disease for surveillance and research.

OBJECTIVE

This study's objective was to develop and assess the accuracy of a natural language processing algorithm for the identification of individuals with cancer or precancer of the cervix and anus.

METHODS

A pipeline-based natural language processing algorithm was developed, which incorporated machine learning and rule-based methods to extract diagnostic elements from the narrative pathology reports. To test the algorithm's classification accuracy, we used a split-validation study design. Full-length cervical and anal pathology reports were randomly selected from 4 clinical pathology laboratories. Two study team members, blinded to the classifications produced by the natural language processing algorithm, manually and independently reviewed all reports and classified them at the document level according to 2 domains (diagnosis and human papillomavirus testing results). Using the manual review as the gold standard, the algorithm's performance was evaluated using standard measurements of accuracy, recall, precision, and F-measure.

RESULTS

The natural language processing algorithm's performance was validated on 949 pathology reports. The algorithm demonstrated accurate identification of abnormal cytology, histology, and positive human papillomavirus tests with accuracies greater than 0.91. Precision was lowest for anal histology reports (0.87, 95% CI 0.59-0.98) and highest for cervical cytology (0.98, 95% CI 0.95-0.99). The natural language processing algorithm missed 2 out of the 15 abnormal anal histology reports, which led to a relatively low recall (0.68, 95% CI 0.43-0.87).

CONCLUSIONS

This study outlines the development and validation of a freely available and easily implementable natural language processing algorithm that can automate the extraction and classification of clinical data from cervical and anal cytology and histology.

摘要

背景

准确识别新诊断的人乳头瘤病毒相关癌症和癌前病变是制定优化人乳头瘤病毒疫苗使用策略的重要一步。人乳头瘤病毒癌症的诊断取决于组织病理学报告,该报告通常以自由格式或非结构化的叙述性文本形式存储在电子病历中。以往对人乳头瘤病毒癌症进行监测的工作依赖于人工审查病理报告以提取诊断信息,这一过程既耗费人力又耗费资源。自然语言处理可用于自动构建和提取病历中非结构化叙述性文本中的临床数据,并可能为识别患有疫苗可预防的人乳头瘤病毒疾病的患者进行监测和研究提供一种实用有效的方法。

目的

本研究的目的是开发并评估一种用于识别宫颈癌和肛门癌或癌前病变个体的自然语言处理算法的准确性。

方法

开发了一种基于管道的自然语言处理算法,该算法结合了机器学习和基于规则的方法,以从叙述性病理报告中提取诊断要素。为了测试该算法的分类准确性,我们采用了拆分验证研究设计。从4个临床病理实验室中随机选择完整的宫颈和肛门病理报告。两名研究团队成员在不知道自然语言处理算法产生的分类结果的情况下,人工独立审查所有报告,并根据两个领域(诊断和人乳头瘤病毒检测结果)在文档级别对其进行分类。以人工审查作为金标准,使用准确性、召回率、精确率和F值的标准测量方法评估该算法的性能。

结果

自然语言处理算法在949份病理报告上进行了性能验证。该算法能够准确识别异常细胞学、组织学以及人乳头瘤病毒检测阳性,准确率均大于0.91。肛门组织学报告的精确率最低(0.87,95%可信区间0.59 - 0.98),宫颈细胞学报告的精确率最高(0.98,95%可信区间0.95 - 0.99)。自然语言处理算法在15份异常肛门组织学报告中漏检了2份,导致召回率相对较低(0.68,95%可信区间0.43 - 0.87)。

结论

本研究概述了一种免费且易于实施的自然语言处理算法的开发与验证,该算法可自动从宫颈和肛门细胞学及组织学中提取和分类临床数据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/64ee/7671846/0e7f002005fc/medinform_v8i11e20826_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验