Suppr超能文献

比较人工智能技术评估分类器在从数字化组织病理学图像自动分级前列腺癌方面的性能。

Comparison of Artificial Intelligence Techniques to Evaluate Performance of a Classifier for Automatic Grading of Prostate Cancer From Digitized Histopathologic Images.

机构信息

Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, British Columbia, Canada.

Department of Urologic Sciences, University of British Columbia, Vancouver, British Columbia, Canada.

出版信息

JAMA Netw Open. 2019 Mar 1;2(3):e190442. doi: 10.1001/jamanetworkopen.2019.0442.

Abstract

IMPORTANCE

Proper evaluation of the performance of artificial intelligence techniques in the analysis of digitized medical images is paramount for the adoption of such techniques by the medical community and regulatory agencies.

OBJECTIVES

To compare several cross-validation (CV) approaches to evaluate the performance of a classifier for automatic grading of prostate cancer in digitized histopathologic images and compare the performance of the classifier when trained using data from 1 expert and multiple experts.

DESIGN, SETTING, AND PARTICIPANTS: This quality improvement study used tissue microarray data (333 cores) from 231 patients who underwent radical prostatectomy at the Vancouver General Hospital between June 27, 1997, and June 7, 2011. Digitized images of tissue cores were annotated by 6 pathologists for 4 classes (benign and Gleason grades 3, 4, and 5) between December 12, 2016, and October 5, 2017. Patches of 192 µm2 were extracted from these images. There was no overlap between patches. A deep learning classifier based on convolutional neural networks was trained to predict a class label from among the 4 classes (benign and Gleason grades 3, 4, and 5) for each image patch. The classification performance was evaluated in leave-patches-out CV, leave-cores-out CV, and leave-patients-out 20-fold CV. The analysis was performed between November 15, 2018, and January 1, 2019.

MAIN OUTCOMES AND MEASURES

The classifier performance was evaluated by its accuracy, sensitivity, and specificity in detection of cancer (benign vs cancer) and in low-grade vs high-grade differentiation (Gleason grade 3 vs grades 4-5). The statistical significance analysis was performed using the McNemar test. The agreement level between pathologists and the classifier was quantified using a quadratic-weighted κ statistic.

RESULTS

On 333 tissue microarray cores from 231 participants with prostate cancer (mean [SD] age, 63.2 [6.3] years), 20-fold leave-patches-out CV resulted in mean (SD) accuracy of 97.8% (1.2%), sensitivity of 98.5% (1.0%), and specificity of 97.5% (1.2%) for classifying benign patches vs cancerous patches. By contrast, 20-fold leave-patients-out CV resulted in mean (SD) accuracy of 85.8% (4.3%), sensitivity of 86.3% (4.1%), and specificity of 85.5% (7.2%). Similarly, 20-fold leave-cores-out CV resulted in mean (SD) accuracy of 86.7% (3.7%), sensitivity of 87.2% (4.0%), and specificity of 87.7% (5.5%). Results of McNemar tests showed that the leave-patches-out CV accuracy, sensitivity, and specificity were significantly higher than those for both leave-patients-out CV and leave-cores-out CV. Similar results were observed for classifying low-grade cancer vs high-grade cancer. When trained on a single expert, the overall agreement in grading between pathologists and the classifier ranged from 0.38 to 0.58; when trained using the majority vote among all experts, it was 0.60.

CONCLUSIONS AND RELEVANCE

Results of this study suggest that in prostate cancer classification from histopathologic images, patch-wise CV and single-expert training and evaluation may lead to a biased estimation of classifier's performance. To allow reproducibility and facilitate comparison between automatic classification methods, studies in the field should evaluate their performance using patient-based CV and multiexpert data. Some of these conclusions may be generalizable to other histopathologic applications and to other applications of machine learning in medicine.

摘要

重要性

正确评估人工智能技术在数字化医学图像分析中的性能对于医学社区和监管机构采用这些技术至关重要。

目的

比较几种交叉验证(CV)方法,以评估用于自动分级前列腺癌的分类器在数字化组织病理学图像中的性能,并比较使用来自 1 位专家和多位专家的数据训练分类器时的性能。

设计、设置和参与者:本质量改进研究使用了来自 231 名在温哥华综合医院接受根治性前列腺切除术的患者的组织微阵列数据(333 个核心),这些患者于 1997 年 6 月 27 日至 2011 年 6 月 7 日接受治疗。组织核心的数字化图像由 6 位病理学家在 2016 年 12 月 12 日至 2017 年 10 月 5 日期间对 4 个等级(良性和 Gleason 分级 3、4 和 5)进行注释。从这些图像中提取了 192 µm2 的斑块。斑块之间没有重叠。基于卷积神经网络的深度学习分类器用于预测每个图像斑块的 4 个类别(良性和 Gleason 分级 3、4 和 5)中的类别标签。在留片外 CV、留核外 CV 和留片外 20 折 CV 中评估了分类性能。分析于 2018 年 11 月 15 日至 2019 年 1 月 1 日进行。

主要结果和措施

使用癌症(良性与癌症)和低级别与高级别分化(Gleason 分级 3 与分级 4-5)的检测准确性、敏感性和特异性来评估分类器的性能。使用 McNemar 检验进行统计学意义分析。使用二次加权κ统计量量化病理学家和分类器之间的一致性水平。

结果

在 231 名患有前列腺癌的参与者的 333 个组织微阵列核心中(平均[标准差]年龄,63.2[6.3]岁),20 折留片外 CV 的平均(标准差)准确率为 97.8%(1.2%)、敏感性为 98.5%(1.0%)和特异性为 97.5%(1.2%),用于区分良性斑块和癌性斑块。相比之下,20 折留片外 CV 的准确率为 85.8%(4.3%)、敏感性为 86.3%(4.1%)和特异性为 85.5%(7.2%)。同样,20 折留核外 CV 的平均(标准差)准确率为 86.7%(3.7%)、敏感性为 87.2%(4.0%)和特异性为 87.7%(5.5%)。McNemar 检验结果表明,留片外 CV 的准确率、敏感性和特异性均显著高于留片外 CV 和留核外 CV。对于低级别癌症与高级别癌症的分类也观察到类似的结果。当使用单个专家进行训练时,病理学家和分类器之间的总体分级一致性范围为 0.38 至 0.58;当使用所有专家的多数投票进行训练时,一致性为 0.60。

结论和相关性

本研究结果表明,在前列腺癌的组织病理学图像分类中,基于斑块的 CV 和单个专家的训练和评估可能会导致对分类器性能的有偏差估计。为了实现可重复性并促进自动分类方法之间的比较,该领域的研究应使用基于患者的 CV 和多专家数据评估其性能。这些结论中的一些可能适用于其他组织病理学应用和医学中的机器学习的其他应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f388/6484626/c82f60f0ccae/jamanetwopen-2-e190442-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验