Odisho Anobel Y, Park Briton, Altieri Nicholas, DeNero John, Cooperberg Matthew R, Carroll Peter R, Yu Bin
Department of Urology, UCSF Helen Diller Family Comprehensive Cancer Center, San Francisco, California, USA.
Department of Statistics, University of California, Berkeley, California, USA.
JAMIA Open. 2020 Oct 14;3(3):431-438. doi: 10.1093/jamiaopen/ooaa029. eCollection 2020 Oct.
Cancer is a leading cause of death, but much of the diagnostic information is stored as unstructured data in pathology reports. We aim to improve uncertainty estimates of machine learning-based pathology parsers and evaluate performance in low data settings.
Our data comes from the Urologic Outcomes Database at UCSF which includes 3232 annotated prostate cancer pathology reports from 2001 to 2018. We approach 17 separate information extraction tasks, involving a wide range of pathologic features. To handle the diverse range of fields, we required 2 statistical models, a document classification method for pathologic features with a small set of possible values and a token extraction method for pathologic features with a large set of values. For each model, we used isotonic calibration to improve the model's estimates of its likelihood of being correct.
Our best document classifier method, a convolutional neural network, achieves a weighted F1 score of 0.97 averaged over 12 fields and our best extraction method achieves an accuracy of 0.93 averaged over 5 fields. The performance saturates as a function of dataset size with as few as 128 data points. Furthermore, while our document classifier methods have reliable uncertainty estimates, our extraction-based methods do not, but after isotonic calibration, expected calibration error drops to below 0.03 for all extraction fields.
We find that when applying machine learning to pathology parsing, large datasets may not always be needed, and that calibration methods can improve the reliability of uncertainty estimates.
癌症是主要的死亡原因之一,但许多诊断信息以非结构化数据的形式存储在病理报告中。我们旨在改进基于机器学习的病理解析器的不确定性估计,并评估在低数据设置下的性能。
我们的数据来自加州大学旧金山分校的泌尿外科结果数据库,其中包括2001年至2018年的3232份带注释的前列腺癌病理报告。我们处理17个不同的信息提取任务,涉及广泛的病理特征。为了处理各种不同的字段,我们需要2种统计模型,一种用于具有少量可能值的病理特征的文档分类方法,以及一种用于具有大量值的病理特征的令牌提取方法。对于每个模型,我们使用等渗校准来改进模型对其正确可能性的估计。
我们最好的文档分类器方法,即卷积神经网络,在12个字段上平均加权F1分数达到0.97,我们最好的提取方法在5个字段上平均准确率达到0.93。性能随着数据集大小的增加而饱和,数据点少至128个时也如此。此外,虽然我们的文档分类器方法具有可靠的不确定性估计,但基于提取的方法却没有,不过经过等渗校准后,所有提取字段的预期校准误差降至0.03以下。
我们发现,将机器学习应用于病理解析时,可能并不总是需要大型数据集,并且校准方法可以提高不确定性估计的可靠性。