Department of Anatomic Pathology, Kasr Alainy Faculty of Medicine, Cairo University, Giza, Egypt.
Department of Computer Science, Faculty of Graduate Studies for Statistical Research, Cairo University, Giza, Egypt.
Acta Cytol. 2024;68(2):160-170. doi: 10.1159/000538465. Epub 2024 Mar 24.
The application of artificial intelligence (AI) algorithms in serous fluid cytology is lacking due to the deficiency in standardized publicly available datasets. Here, we develop a novel public serous effusion cytology dataset. Furthermore, we apply AI algorithms on it to test its diagnostic utility and safety in clinical practice.
The work is divided into three phases. Phase 1 entails building the dataset based on the multitiered evidence-based classification system proposed by the International System (TIS) of serous fluid cytology along with ground-truth tissue diagnosis for malignancy. To ensure reliable results of future AI research on this dataset, we carefully consider all the steps of the preparation and staining from a real-world cytopathology perspective. In phase 2, we pay special consideration to the image acquisition pipeline to ensure image integrity. Then we utilize the power of transfer learning using the convolutional layers of the VGG16 deep learning model for feature extraction. Finally, in phase 3, we apply the random forest classifier on the constructed dataset.
The dataset comprises 3,731 images distributed among the four TIS diagnostic categories. The model achieves 74% accuracy in this multiclass classification problem. Using a one-versus-all classifier, the fallout rate for images that are misclassified as negative for malignancy despite being a higher risk diagnosis is 0.13. Most of these misclassified images (77%) belong to the atypia of undetermined significance category in concordance with real-life statistics.
This is the first and largest publicly available serous fluid cytology dataset based on a standardized diagnostic system. It is also the first dataset to include various types of effusions and pericardial fluid specimens. In addition, it is the first dataset to include the diagnostically challenging atypical categories. AI algorithms applied on this novel dataset show reliable results that can be incorporated into actual clinical practice with minimal risk of missing a diagnosis of malignancy. This work provides a foundation for researchers to develop and test further AI algorithms for the diagnosis of serous effusions.
由于标准化的公共可用数据集的缺乏,人工智能(AI)算法在浆膜腔积液细胞学中的应用受到限制。在这里,我们开发了一个新的公共浆膜腔积液细胞学数据集。此外,我们还将 AI 算法应用于该数据集,以测试其在临床实践中的诊断效用和安全性。
这项工作分为三个阶段。第 1 阶段是根据国际浆膜腔积液细胞学系统(TIS)提出的多层次循证分类系统以及恶性肿瘤的组织诊断建立数据集。为了确保未来在该数据集上进行 AI 研究的结果可靠,我们从实际细胞病理学的角度仔细考虑了制备和染色的所有步骤。在第 2 阶段,我们特别注意图像采集管道,以确保图像的完整性。然后,我们利用 VGG16 深度学习模型的卷积层进行迁移学习,以提取特征。最后,在第 3 阶段,我们在构建的数据集上应用随机森林分类器。
该数据集包含 3731 张图像,分布在 TIS 的四个诊断类别中。该模型在这个多类分类问题中达到了 74%的准确率。使用一对一分类器,对于被错误归类为恶性肿瘤阴性但具有更高风险诊断的图像,其错误分类率为 0.13。大多数这些被错误分类的图像(77%)与现实生活中的统计数据一致,属于不确定意义的不典型类别。
这是第一个也是最大的基于标准化诊断系统的公共浆膜腔积液细胞学数据集。它也是第一个包含各种类型的渗出液和心包液标本的数据集。此外,它是第一个包含具有挑战性的不典型类别的数据集。应用于这个新数据集的 AI 算法显示出可靠的结果,可以将其纳入实际的临床实践中,而不会错过恶性肿瘤的诊断。这项工作为研究人员开发和测试用于诊断浆膜腔积液的进一步 AI 算法提供了基础。