Jouhet V, Defossez G, Burgun A, le Beux P, Levillain P, Ingrand P, Claveau V
Unité d'épidémiologie, biostatistique et registre des cancers de Poitou-Charentes, Faculté de médecine, Centre Hospitalier Universitaire de Poitiers, Université de Poitiers, Poitiers Cedex, France. vianney.
Methods Inf Med. 2012;51(3):242-51. doi: 10.3414/ME11-01-0005. Epub 2011 Jul 26.
Our study aimed to construct and evaluate functions called "classifiers", produced by supervised machine learning techniques, in order to categorize automatically pathology reports using solely their content.
Patients from the Poitou-Charentes Cancer Registry having at least one pathology report and a single non-metastatic invasive neoplasm were included. A descriptor weighting function accounting for the distribution of terms among targeted classes was developed and compared to classic methods based on inverse document frequencies. The classification was performed with support vector machine (SVM) and Naive Bayes classifiers. Two levels of granularity were tested for both the topographical and the morphological axes of the ICD-O3 code. The ability to correctly attribute a precise ICD-O3 code and the ability to attribute the broad category defined by the International Agency for Research on Cancer (IARC) for the multiple primary cancer registration rules were evaluated using F1-measures.
5121 pathology reports produced by 35 pathologists were selected. The best performance was achieved by our class-weighted descriptor, associated with a SVM classifier. Using this method, the pathology reports were properly classified in the IARC categories with F1-measures of 0.967 for both topography and morphology. The ICD-O3 code attribution had lower performance with a 0.715 F1-measure for topography and 0.854 for morphology.
These results suggest that free-text pathology reports could be useful as a data source for automated systems in order to identify and notify new cases of cancer. Future work is needed to evaluate the improvement in performance obtained from the use of natural language processing, including the case of multiple tumor description and possible incorporation of other medical documents such as surgical reports.
我们的研究旨在构建并评估由监督式机器学习技术生成的“分类器”功能,以便仅根据病理报告的内容对其进行自动分类。
纳入来自普瓦图-夏朗德癌症登记处、至少有一份病理报告且患有单一非转移性浸润性肿瘤的患者。开发了一种考虑术语在目标类别中分布的描述符加权函数,并将其与基于逆文档频率的经典方法进行比较。使用支持向量机(SVM)和朴素贝叶斯分类器进行分类。对ICD-O3代码的地形学和形态学轴测试了两个粒度级别。使用F1分数评估正确归属精确ICD-O3代码的能力以及归属国际癌症研究机构(IARC)为多原发癌登记规则定义的宽泛类别的能力。
选择了35名病理学家出具的5121份病理报告。我们的类别加权描述符与SVM分类器相结合取得了最佳性能。使用该方法,病理报告在IARC类别中得到了正确分类,地形学和形态学的F1分数均为0.967。ICD-O3代码归属的性能较低,地形学的F1分数为0.715,形态学的F1分数为0.854。
这些结果表明,自由文本病理报告可作为自动化系统的数据源,用于识别和通报新的癌症病例。未来需要开展工作,以评估使用自然语言处理(包括多肿瘤描述情况以及可能纳入手术报告等其他医学文档)所带来的性能提升。