Chen Dake, Han Ying, Duncan Jacque, Jia Lin, Shan Jing
Department of Ophthalmology, University of California, San Francisco, San Francisco, California.
Digillect LLC, San Francisco, California.
Ophthalmol Sci. 2024 Apr 14;4(5):100531. doi: 10.1016/j.xops.2024.100531. eCollection 2024 Sep-Oct.
Training data fuel and shape the development of artificial intelligence (AI) models. Intensive data requirements are a major bottleneck limiting the success of AI tools in sectors with inherently scarce data. In health care, training data are difficult to curate, triggering growing concerns that the current lack of access to health care by under-privileged social groups will translate into future bias in health care AIs. In this report, we developed an autoencoder to grow and enhance inherently scarce datasets to alleviate our dependence on big data.
Computational study with open-source data.
The data were obtained from 6 open-source datasets comprising patients aged 40-80 years in Singapore, China, India, and Spain.
The reported framework generates synthetic images based on real-world patient imaging data. As a test case, we used autoencoder to expand publicly available training sets of optic disc photos, and evaluated the ability of the resultant datasets to train AI models in the detection of glaucomatous optic neuropathy.
Area under the receiver operating characteristic curve (AUC) were used to evaluate the performance of the glaucoma detector. A higher AUC indicates better detection performance.
Results show that enhancing datasets with synthetic images generated by autoencoder led to superior training sets that improved the performance of AI models.
Our findings here help address the increasingly untenable data volume and quality requirements for AI model development and have implications beyond health care, toward empowering AI adoption for all similarly data-challenged fields.
The authors have no proprietary or commercial interest in any materials discussed in this article.
训练数据推动并塑造人工智能(AI)模型的发展。大量的数据需求是限制人工智能工具在数据天然稀缺领域取得成功的主要瓶颈。在医疗保健领域,训练数据难以整理,引发了人们越来越多的担忧,即当前弱势群体难以获得医疗保健服务的状况将导致未来医疗保健人工智能出现偏差。在本报告中,我们开发了一种自动编码器,以扩充和增强天然稀缺的数据集,从而减轻我们对大数据的依赖。
使用开源数据进行的计算研究。
数据来自6个开源数据集,这些数据集包含新加坡、中国、印度和西班牙40至80岁的患者。
所报告的框架基于真实世界的患者影像数据生成合成图像。作为一个测试案例,我们使用自动编码器来扩充公开可用的视盘照片训练集,并评估所得数据集训练人工智能模型以检测青光眼性视神经病变的能力。
使用受试者操作特征曲线下面积(AUC)来评估青光眼检测器的性能。AUC越高表明检测性能越好。
结果表明,使用自动编码器生成的合成图像增强数据集可得到更优的训练集,从而提高人工智能模型的性能。
我们在此的研究结果有助于解决人工智能模型开发中日益难以维持的数据量和质量要求问题,其影响不仅限于医疗保健领域,还能推动所有类似数据匮乏领域采用人工智能。
作者对本文讨论的任何材料均无所有权或商业利益。