Butt Luke, Zuccon Guido, Nguyen Anthony, Bergheim Anton, Grayson Narelle
The Australian e-Health Research Centre, Brisbane, Queensland, Australia;
Australas Med J. 2013 May 30;6(5):292-9. doi: 10.4066/AMJ.2013.1654. Print 2013.
Cancer monitoring and prevention relies on the critical aspect of timely notification of cancer cases. However, the abstraction and classification of cancer from the free-text of pathology reports and other relevant documents, such as death certificates, exist as complex and time-consuming activities.
In this paper, approaches for the automatic detection of notifiable cancer cases as the cause of death from free-text death certificates supplied to Cancer Registries are investigated.
A number of machine learning classifiers were studied. Features were extracted using natural language techniques and the Medtex toolkit. The numerous features encompassed stemmed words, bi-grams, and concepts from the SNOMED CT medical terminology. The baseline consisted of a keyword spotter using keywords extracted from the long description of ICD-10 cancer related codes.
Death certificates with notifiable cancer listed as the cause of death can be effectively identified with the methods studied in this paper. A Support Vector Machine (SVM) classifier achieved best performance with an overall Fmeasure of 0.9866 when evaluated on a set of 5,000 freetext death certificates using the token stem feature set. The SNOMED CT concept plus token stem feature set reached the lowest variance (0.0032) and false negative rate (0.0297) while achieving an F-measure of 0.9864. The SVM classifier accounts for the first 18 of the top 40 evaluated runs, and entails the most robust classifier with a variance of 0.001141, half the variance of the other classifiers.
The selection of features significantly produced the most influences on the performance of the classifiers, although the type of classifier employed also affects performance. In contrast, the feature weighting schema created a negligible effect on performance. Specifically, it is found that stemmed tokens with or without SNOMED CT concepts create the most effective feature when combined with an SVM classifier.
癌症监测与预防依赖于及时通报癌症病例这一关键环节。然而,从病理报告及其他相关文件(如死亡证明)的自由文本中提取和分类癌症信息,是复杂且耗时的工作。
本文研究了从提供给癌症登记处的自由文本死亡证明中自动检测应通报癌症病例作为死亡原因的方法。
研究了多种机器学习分类器。使用自然语言技术和Medtex工具包提取特征。众多特征包括词干、双词搭配以及来自SNOMED CT医学术语的概念。基线由一个关键词识别器组成,该识别器使用从ICD - 10癌症相关代码的详细描述中提取的关键词。
本文研究的方法能够有效识别将应通报癌症列为死亡原因的死亡证明。在一组5000份自由文本死亡证明上使用词干特征集进行评估时,支持向量机(SVM)分类器表现最佳,总体F值为0.9866。SNOMED CT概念加词干特征集的方差最低(0.0032),假阴性率最低(0.0297),同时F值达到0.9864。SVM分类器在40次评估运行中排名前18,是最稳健的分类器,方差为0.001141,是其他分类器方差的一半。
特征的选择对分类器性能的影响最为显著,尽管所采用的分类器类型也会影响性能。相比之下,特征加权方案对性能的影响可忽略不计。具体而言,发现带有或不带有SNOMED CT概念的词干与SVM分类器结合时能产生最有效的特征。