Klein Jonathan, Waller Rebekah, Pirk Sören, Pałubicki Wojtek, Tester Mark, Michels Dominik L
Computational Sciences Group, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.
Center for Desert Agriculture, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.
Front Plant Sci. 2024 Sep 16;15:1360113. doi: 10.3389/fpls.2024.1360113. eCollection 2024.
The rise of artificial intelligence (AI) and in particular modern machine learning (ML) algorithms during the last decade has been met with great interest in the agricultural industry. While undisputedly powerful, their main drawback remains the need for sufficient and diverse training data. The collection of real datasets and their annotation are the main cost drivers of ML developments, and while promising results on synthetically generated training data have been shown, their generation is not without difficulties on their own. In this paper, we present a development model for the iterative, cost-efficient generation of synthetic training data. Its application is demonstrated by developing a low-cost early disease detector for tomato plants () using synthetic training data. A neural classifier is trained by exclusively using synthetic images, whose generation process is iteratively refined to obtain optimal performance. In contrast to other approaches that rely on a human assessment of similarity between real and synthetic data, we instead introduce a structured, quantitative approach. Our evaluation shows superior generalization results when compared to using non-task-specific real training data and a higher cost efficiency of development compared to traditional synthetic training data. We believe that our approach will help to reduce the cost of synthetic data generation in future applications.
在过去十年中,人工智能(AI)尤其是现代机器学习(ML)算法的兴起在农业领域引发了极大的兴趣。尽管它们无疑功能强大,但其主要缺点仍然是需要足够且多样的训练数据。真实数据集的收集及其标注是ML开发的主要成本驱动因素,虽然已经在合成生成的训练数据上取得了有前景的结果,但其生成本身并非没有困难。在本文中,我们提出了一种用于迭代、经济高效地生成合成训练数据的开发模型。通过使用合成训练数据开发一种低成本的番茄植株早期病害检测器()来展示其应用。一个神经分类器仅使用合成图像进行训练,其生成过程经过迭代优化以获得最佳性能。与其他依赖人工评估真实数据和合成数据之间相似度的方法不同,我们引入了一种结构化的定量方法。我们的评估表明,与使用非特定任务的真实训练数据相比,具有更好的泛化结果,并且与传统合成训练数据相比,开发成本效率更高。我们相信我们的方法将有助于在未来应用中降低合成数据生成的成本。