Oliwa Tomasz, Maron Steven B, Chase Leah M, Lomnicki Samantha, Catenacci Daniel V T, Furner Brian, Volchenboum Samuel L
The University of Chicago, Chicago, IL.
Memorial Sloan Kettering Cancer Center, New York, NY.
JCO Clin Cancer Inform. 2019 Aug;3:1-8. doi: 10.1200/CCI.19.00008.
Robust institutional tumor banks depend on continuous sample curation or else subsequent biopsy or resection specimens are overlooked after initial enrollment. Curation automation is hindered by semistructured free-text clinical pathology notes, which complicate data abstraction. Our motivation is to develop a natural language processing method that dynamically identifies existing pathology specimen elements necessary for locating specimens for future use in a manner that can be re-implemented by other institutions.
Pathology reports from patients with gastroesophageal cancer enrolled in The University of Chicago GI oncology tumor bank were used to train and validate a novel composite natural language processing-based pipeline with a supervised machine learning classification step to separate notes into internal (primary review) and external (consultation) reports; a named-entity recognition step to obtain label (accession number), location, date, and sublabels (block identifiers); and a results proofreading step.
We analyzed 188 pathology reports, including 82 internal reports and 106 external consult reports, and successfully extracted named entities grouped as sample information (label, date, location). Our approach identified up to 24 additional unique samples in external consult notes that could have been overlooked. Our classification model obtained 100% accuracy on the basis of 10-fold cross-validation. Precision, recall, and F1 for class-specific named-entity recognition models show strong performance.
Through a combination of natural language processing and machine learning, we devised a re-implementable and automated approach that can accurately extract specimen attributes from semistructured pathology notes to dynamically populate a tumor registry.
强大的机构肿瘤库依赖于持续的样本管理,否则后续的活检或切除标本在初始入组后可能会被忽视。半结构化的自由文本临床病理记录阻碍了管理自动化,这使得数据提取变得复杂。我们的动机是开发一种自然语言处理方法,以动态识别定位标本以供未来使用所需的现有病理标本元素,且该方法可被其他机构重新实施。
使用芝加哥大学胃肠肿瘤肿瘤库中登记的食管癌患者的病理报告,来训练和验证一种基于自然语言处理的新型复合流程,该流程包括一个监督式机器学习分类步骤,用于将记录分为内部(初次审核)和外部(会诊)报告;一个命名实体识别步骤,以获取标签( accession编号)、位置、日期和子标签(块标识符);以及一个结果校对步骤。
我们分析了188份病理报告,包括82份内部报告和106份外部会诊报告,并成功提取了归类为样本信息(标签、日期、位置)的命名实体。我们的方法在外部会诊记录中识别出多达24个可能被忽视的额外独特样本。我们的分类模型在10折交叉验证的基础上获得了100%的准确率。特定类别命名实体识别模型的精确率、召回率和F1值显示出良好的性能。
通过自然语言处理和机器学习的结合,我们设计了一种可重新实施的自动化方法,该方法可以从半结构化病理记录中准确提取标本属性,以动态填充肿瘤登记册。