Nakayama Luis Filipe, Restrepo David, Matos João, Ribeiro Lucas Zago, Malerbi Fernando Korn, Celi Leo Anthony, Regatieri Caio Saito
Department of Ophthalmology, São Paulo Federal University, São Paulo, São Paulo, Brazil.
Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America.
PLOS Digit Health. 2024 Jul 11;3(7):e0000454. doi: 10.1371/journal.pdig.0000454. eCollection 2024 Jul.
The Brazilian Multilabel Ophthalmological Dataset (BRSET) addresses the scarcity of publicly available ophthalmological datasets in Latin America. BRSET comprises 16,266 color fundus retinal photos from 8,524 Brazilian patients, aiming to enhance data representativeness, serving as a research and teaching tool. It contains sociodemographic information, enabling investigations into differential model performance across demographic groups.
Data from three São Paulo outpatient centers yielded demographic and medical information from electronic records, including nationality, age, sex, clinical history, insulin use, and duration of diabetes diagnosis. A retinal specialist labeled images for anatomical features (optic disc, blood vessels, macula), quality control (focus, illumination, image field, artifacts), and pathologies (e.g., diabetic retinopathy). Diabetic retinopathy was graded using International Clinic Diabetic Retinopathy and Scottish Diabetic Retinopathy Grading. Validation used a ConvNext model trained during 50 epochs using a weighted cross entropy loss to avoid overfitting, with 70% training (20% validation), and 30% testing subsets. Performance metrics included area under the receiver operating curve (AUC) and Macro F1-score. Saliency maps were calculated for interpretability.
BRSET comprises 65.1% Canon CR2 and 34.9% Nikon NF5050 images. 61.8% of the patients are female, and the average age is 57.6 (± 18.26) years. Diabetic retinopathy affected 15.8% of patients, across a spectrum of disease severity. Anatomically, 20.2% showed abnormal optic discs, 4.9% abnormal blood vessels, and 28.8% abnormal macula. A ConvNext V2 model was trained and evaluated BRSET in four prediction tasks: "binary diabetic retinopathy diagnosis (Normal vs Diabetic Retinopathy)" (AUC: 97, F1: 89); "3 class diabetic retinopathy diagnosis (Normal, Proliferative, Non-Proliferative)" (AUC: 97, F1: 82); "diabetes diagnosis" (AUC: 91, F1: 83); "sex classification" (AUC: 87, F1: 70).
BRSET is the first multilabel ophthalmological dataset in Brazil and Latin America. It provides an opportunity for investigating model biases by evaluating performance across demographic groups. The model performance of three prediction tasks demonstrates the value of the dataset for external validation and for teaching medical computer vision to learners in Latin America using locally relevant data sources.
巴西多标签眼科数据集(BRSET)解决了拉丁美洲公开可用眼科数据集稀缺的问题。BRSET包含来自8524名巴西患者的16266张彩色眼底视网膜照片,旨在提高数据代表性,作为研究和教学工具。它包含社会人口统计学信息,有助于调查不同人口群体的模型性能差异。
来自圣保罗三个门诊中心的数据产生了电子记录中的人口统计学和医学信息,包括国籍、年龄、性别、临床病史、胰岛素使用情况和糖尿病诊断时长。一名视网膜专家对图像的解剖特征(视盘、血管、黄斑)、质量控制(对焦、照明、图像视野、伪影)和病变(如糖尿病视网膜病变)进行标注。糖尿病视网膜病变采用国际临床糖尿病视网膜病变和苏格兰糖尿病视网膜病变分级进行分级。验证使用在50个轮次中训练的ConvNext模型,采用加权交叉熵损失以避免过拟合,训练集占70%(验证集占20%),测试集占30%。性能指标包括受试者操作特征曲线下面积(AUC)和宏F1分数。计算显著性图以进行可解释性分析。
BRSET包含65.1%的佳能CR2图像和34.9%的尼康NF5050图像。61.8%的患者为女性,平均年龄为57.6(±18.26)岁。糖尿病视网膜病变影响了15.8%的患者,疾病严重程度各异。从解剖学角度看,20.2%的患者视盘异常,4.9%的患者血管异常,28.8%的患者黄斑异常。训练了一个ConvNext V2模型,并在四个预测任务中对BRSET进行评估:“二元糖尿病视网膜病变诊断(正常与糖尿病视网膜病变)”(AUC:97,F1:89);“三级糖尿病视网膜病变诊断(正常、增殖性、非增殖性)”(AUC:97,F1:82);“糖尿病诊断”(AUC:91,F1:83);“性别分类”(AUC:87,F1:70)。
BRSET是巴西和拉丁美洲首个多标签眼科数据集。它为通过评估不同人口群体的性能来研究模型偏差提供了机会。三个预测任务的模型性能证明了该数据集对于外部验证以及使用本地相关数据源向拉丁美洲学习者教授医学计算机视觉的价值。