Institute of Dentistry, School of Medicine, Medical Sciences and Nutrition, University of Aberdeen, Aberdeen, AB25 2ZR, United Kingdom; Center for Research in Oral Cancer, Department of Basic Sciences, Faculty of Dental Sciences, University of Peradeniya, Kandy, 20400, Sri Lanka.
Department of Computer Engineering, Faculty of Engineering, University of Peradeniya, Kandy, 20400, Sri Lanka.
Oral Oncol. 2024 Sep;156:106946. doi: 10.1016/j.oraloncology.2024.106946. Epub 2024 Jul 13.
This study aims to address the critical gap of unavailability of publicly accessible oral cavity image datasets for developing machine learning (ML) and artificial intelligence (AI) technologies for the diagnosis and prognosis of oral cancer (OCA) and oral potentially malignant disorders (OPMD), with a particular focus on the high prevalence and delayed diagnosis in Asia.
Following ethical approval and informed written consent, images of the oral cavity were obtained from mobile phone cameras and clinical data was extracted from hospital records from patients attending to the Dental Teaching Hospital, Peradeniya, Sri Lanka. After data management and hosting, image categorization and annotations were done by clinicians using a custom-made software tool developed by the research team.
A dataset comprising 3000 high-quality, anonymized images obtained from 714 patients were classified into four distinct categories: healthy, benign, OPMD, and OCA. Images were annotated with polygonal shaped oral cavity and lesion boundaries. Each image is accompanied by patient metadata, including age, sex, diagnosis, and risk factor profiles such as smoking, alcohol, and betel chewing habits.
Researchers can utilize the annotated images in the COCO format, along with the patients' metadata, to enhance ML and AI algorithm development.
本研究旨在解决一个关键问题,即缺乏可公开获取的用于开发机器学习 (ML) 和人工智能 (AI) 技术的口腔图像数据集,以用于口腔癌 (OCA) 和口腔潜在恶性疾病 (OPMD) 的诊断和预后,特别是针对亚洲高发和诊断延迟的问题。
在获得伦理批准和书面知情同意后,我们使用来自斯里兰卡佩拉德尼亚牙科教学医院患者的移动电话摄像头获取口腔图像,并从医院记录中提取临床数据。在进行数据管理和托管后,由临床医生使用研究团队开发的定制软件工具对图像进行分类和注释。
我们从 714 名患者中获得了包含 3000 张高质量、匿名图像的数据集,将其分为四个不同类别:健康、良性、OPMD 和 OCA。对图像进行了多边形形状的口腔和病变边界标注。每张图像都附有患者的元数据,包括年龄、性别、诊断以及吸烟、饮酒和咀嚼槟榔等风险因素的概况。
研究人员可以使用 COCO 格式的标注图像以及患者的元数据来增强 ML 和 AI 算法的开发。