Zech John, Forde Jessica, Titano Joseph J, Kaji Deepak, Costa Anthony, Oermann Eric Karl
Department of Radiology, Icahn School of Medicine, New York, NY, USA.
Project Jupyter, 190 Doe Library, Berkeley, CA, USA.
Ann Transl Med. 2019 Jun;7(11):233. doi: 10.21037/atm.2018.08.11.
Errors in grammar, spelling, and usage in radiology reports are common. To automatically detect inappropriate insertions, deletions, and substitutions of words in radiology reports, we proposed using a neural sequence-to-sequence (seq2seq) model.
Head CT and chest radiograph reports from Mount Sinai Hospital (MSH) (n=61,722 and 818,978, respectively), Mount Sinai Queens (MSQ) (n=30,145 and 194,309, respectively) and MIMIC-III (n=32,259 and 54,685) were converted into sentences. Insertions, substitutions, and deletions of words were randomly introduced. Seq2seq models were trained using corrupted sentences as input to predict original uncorrupted sentences. Three models were trained using head CTs from MSH, chest radiographs from MSH, and head CTs from all three collections. Model performance was assessed across different sites and modalities. A sample of original, uncorrupted sentences were manually reviewed for any error in syntax, usage, or spelling to estimate real-world proofreading performance of the algorithm.
Seq2seq detected 90.3% and 88.2% of corrupted sentences with 97.7% and 98.8% specificity in same-site, same-modality test sets for head CTs and chest radiographs, respectively. Manual review of original, uncorrupted same-site same-modality head CT sentences demonstrated seq2seq positive predictive value (PPV) 0.393 (157/400; 95% CI, 0.346-0.441) and negative predictive value (NPV) 0.986 (789/800; 95% CI, 0.976-0.992) for detecting sentences containing real-world errors, with estimated sensitivity of 0.389 (95% CI, 0.267-0.542) and specificity 0.986 (95% CI, 0.985-0.987) over n=86,211 uncorrupted training examples.
Seq2seq models can be highly effective at detecting erroneous insertions, deletions, and substitutions of words in radiology reports. To achieve high performance, these models require site- and modality-specific training examples. Incorporating additional targeted training data could further improve performance in detecting real-world errors in reports.
放射学报告中存在语法、拼写和用法错误很常见。为了自动检测放射学报告中单词的不适当插入、删除和替换,我们提出使用神经序列到序列(seq2seq)模型。
将西奈山医院(MSH)(分别为61,722份和818,978份)、西奈山皇后区医院(MSQ)(分别为30,145份和194,309份)以及MIMIC-III(32,259份和54,685份)的头部CT和胸部X光报告转换为句子。随机引入单词的插入、替换和删除。使用损坏的句子作为输入训练seq2seq模型,以预测原始未损坏的句子。使用来自MSH的头部CT、来自MSH的胸部X光以及来自所有三个数据集的头部CT训练了三个模型。在不同的站点和模式下评估模型性能。人工检查了一组原始的、未损坏的句子,以查找语法、用法或拼写方面的任何错误,以估计该算法在现实世界中的校对性能。
在头部CT和胸部X光的同站点、同模式测试集中,seq2seq分别检测到90.3%和88.2%的损坏句子,特异性分别为97.7%和98.8%。对原始的、未损坏的同站点同模式头部CT句子进行人工检查显示,seq2seq在检测包含现实世界错误的句子时,阳性预测值(PPV)为0.393(157/400;95%CI,0.346 - 0.441),阴性预测值(NPV)为0.986(789/800;95%CI,0.976 - 0.992),在n = 86,211个未损坏的训练示例中,估计灵敏度为0.389(95%CI,0.267 - 0.542),特异性为0.986(95%CI,0.985 - 0.987)。
Seq2seq模型在检测放射学报告中单词的错误插入、删除和替换方面可以非常有效。为了实现高性能,这些模型需要特定于站点和模式的训练示例。纳入额外的针对性训练数据可以进一步提高检测报告中现实世界错误的性能。