Koopman Bevan, Zuccon Guido, Nguyen Anthony, Bergheim Anton, Grayson Narelle
The Australian e-Health Research Centre, CSIRO, Brisbane, Australia.
Queensland University of Technology, Brisbane, Australia.
Int J Med Inform. 2015 Nov;84(11):956-65. doi: 10.1016/j.ijmedinf.2015.08.004. Epub 2015 Aug 13.
Death certificates provide an invaluable source for cancer mortality statistics; however, this value can only be realised if accurate, quantitative data can be extracted from certificates--an aim hampered by both the volume and variable nature of certificates written in natural language. This paper proposes an automatic classification system for identifying cancer related causes of death from death certificates.
Detailed features, including terms, n-grams and SNOMED CT concepts were extracted from a collection of 447,336 death certificates. These features were used to train Support Vector Machine classifiers (one classifier for each cancer type). The classifiers were deployed in a cascaded architecture: the first level identified the presence of cancer (i.e., binary cancer/nocancer) and the second level identified the type of cancer (according to the ICD-10 classification system). A held-out test set was used to evaluate the effectiveness of the classifiers according to precision, recall and F-measure. In addition, detailed feature analysis was performed to reveal the characteristics of a successful cancer classification model.
The system was highly effective at identifying cancer as the underlying cause of death (F-measure 0.94). The system was also effective at determining the type of cancer for common cancers (F-measure 0.7). Rare cancers, for which there was little training data, were difficult to classify accurately (F-measure 0.12). Factors influencing performance were the amount of training data and certain ambiguous cancers (e.g., those in the stomach region). The feature analysis revealed a combination of features were important for cancer type classification, with SNOMED CT concept and oncology specific morphology features proving the most valuable.
The system proposed in this study provides automatic identification and characterisation of cancers from large collections of free-text death certificates. This allows organisations such as Cancer Registries to monitor and report on cancer mortality in a timely and accurate manner. In addition, the methods and findings are generally applicable beyond cancer classification and to other sources of medical text besides death certificates.
死亡证明是癌症死亡率统计的宝贵数据来源;然而,只有从证明中提取准确、定量的数据,其价值才能得以体现,而这一目标因自然语言书写的证明数量庞大且性质各异而受阻。本文提出了一种用于从死亡证明中识别癌症相关死因的自动分类系统。
从447336份死亡证明中提取了详细特征,包括术语、n元语法和SNOMED CT概念。这些特征用于训练支持向量机分类器(每种癌症类型一个分类器)。分类器以级联架构部署:第一级识别癌症的存在(即癌症/非癌症二元分类),第二级识别癌症类型(根据ICD-10分类系统)。使用一个留出的测试集根据精确率、召回率和F值评估分类器的有效性。此外,进行了详细的特征分析以揭示成功的癌症分类模型的特征。
该系统在将癌症识别为根本死因方面非常有效(F值0.94)。该系统在确定常见癌症的癌症类型方面也很有效(F值0.7)。由于训练数据很少,罕见癌症难以准确分类(F值0.12)。影响性能的因素包括训练数据量和某些模糊的癌症(例如胃部区域的癌症)。特征分析表明,多种特征组合对癌症类型分类很重要,其中SNOMED CT概念和肿瘤学特定形态特征被证明最有价值。
本研究中提出的系统可从大量自由文本死亡证明中自动识别和表征癌症。这使得癌症登记处等组织能够及时、准确地监测和报告癌症死亡率。此外,这些方法和发现通常不仅适用于癌症分类,还适用于死亡证明之外的其他医学文本来源。