From the Departments of Radiology (J.Z., J.T., J.S., A.S.) and Neurosurgery (M.P., M.B., A.C., J.B., E.K.O.), Icahn School of Medicine, 1 Gustave Levy Pl, New York, NY 10029; and Department of Bioengineering and Bioinformatics, Boston University, Boston, Mass (J.L.).
Radiology. 2018 May;287(2):570-580. doi: 10.1148/radiol.2018171093. Epub 2018 Jan 30.
Purpose To compare different methods for generating features from radiology reports and to develop a method to automatically identify findings in these reports. Materials and Methods In this study, 96 303 head computed tomography (CT) reports were obtained. The linguistic complexity of these reports was compared with that of alternative corpora. Head CT reports were preprocessed, and machine-analyzable features were constructed by using bag-of-words (BOW), word embedding, and Latent Dirichlet allocation-based approaches. Ultimately, 1004 head CT reports were manually labeled for findings of interest by physicians, and a subset of these were deemed critical findings. Lasso logistic regression was used to train models for physician-assigned labels on 602 of 1004 head CT reports (60%) using the constructed features, and the performance of these models was validated on a held-out 402 of 1004 reports (40%). Models were scored by area under the receiver operating characteristic curve (AUC), and aggregate AUC statistics were reported for (a) all labels, (b) critical labels, and (c) the presence of any critical finding in a report. Sensitivity, specificity, accuracy, and F1 score were reported for the best performing model's (a) predictions of all labels and (b) identification of reports containing critical findings. Results The best-performing model (BOW with unigrams, bigrams, and trigrams plus average word embeddings vector) had a held-out AUC of 0.966 for identifying the presence of any critical head CT finding and an average 0.957 AUC across all head CT findings. Sensitivity and specificity for identifying the presence of any critical finding were 92.59% (175 of 189) and 89.67% (191 of 213), respectively. Average sensitivity and specificity across all findings were 90.25% (1898 of 2103) and 91.72% (18 351 of 20 007), respectively. Simpler BOW methods achieved results competitive with those of more sophisticated approaches, with an average AUC for presence of any critical finding of 0.951 for unigram BOW versus 0.966 for the best-performing model. The Yule I of the head CT corpus was 34, markedly lower than that of the Reuters corpus (at 103) or I2B2 discharge summaries (at 271), indicating lower linguistic complexity. Conclusion Automated methods can be used to identify findings in radiology reports. The success of this approach benefits from the standardized language of these reports. With this method, a large labeled corpus can be generated for applications such as deep learning. RSNA, 2018 Online supplemental material is available for this article.
目的 比较从放射学报告中生成特征的不同方法,并开发一种自动识别这些报告中发现的方法。
材料与方法 在本研究中,共获得 96303 例头部 CT 报告。比较了这些报告的语言复杂性与其他语料库的语言复杂性。对头部 CT 报告进行预处理,并通过使用词袋(BOW)、词嵌入和基于潜在狄利克雷分配的方法构建可机器分析的特征。最终,由医生手动标记了 1004 例头部 CT 报告中感兴趣的发现,其中一部分被认为是关键发现。使用构建的特征,使用 Lasso 逻辑回归对 602 例头部 CT 报告(60%)中的医生分配标签进行模型训练,然后在 1004 例报告的 402 例(40%)中验证这些模型的性能。使用接收器工作特征曲线下的面积(AUC)对模型进行评分,并报告了(a)所有标签,(b)关键标签和(c)报告中存在任何关键发现的聚合 AUC 统计信息。报告了最佳模型(使用 BOW 中的单字、双字和三字加平均词嵌入向量)的(a)所有标签预测和(b)识别包含关键发现的报告的灵敏度、特异性、准确性和 F1 分数。
结果 表现最佳的模型(BOW 与单字、双字和三字加平均词嵌入向量)在识别任何关键头部 CT 发现的存在方面的验证集 AUC 为 0.966,在所有头部 CT 发现方面的平均 AUC 为 0.957。识别任何关键发现的存在的灵敏度和特异性分别为 92.59%(175/189)和 89.67%(191/213)。在所有发现方面的平均灵敏度和特异性分别为 90.25%(1898/2103)和 91.72%(18351/20007)。更简单的 BOW 方法的结果与更复杂方法的结果相当,存在任何关键发现的平均 AUC 为 0.951,而表现最佳的模型为 0.966。头部 CT 语料库的 Yule I 为 34,明显低于 Reuters 语料库(为 103)或 I2B2 出院记录(为 271),表明语言复杂性较低。
结论 自动化方法可用于识别放射学报告中的发现。这种方法的成功得益于这些报告标准化的语言。通过这种方法,可以为深度学习等应用生成大型标记语料库。
RSNA,2018 在线补充材料可在本文中获得。