打破用于机器学习辅助植物研究的人工标注训练数据的壁垒——利用航空图像

Breaking the barrier of human-annotated training data for machine learning-aided plant research using aerial imagery.

作者信息

Varela Sebastian, Zheng Xuying, Njuguna Joyce, Sacks Erik, Allen Dylan, Ruhter Jeremy, Leakey Andrew D B

机构信息

Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana Champaign, Urbana, IL 61801, USA.

Independent Researcher, Canelones 15800, Uruguay.

出版信息

Plant Physiol. 2025 Mar 28;197(4). doi: 10.1093/plphys/kiaf132.

DOI:10.1093/plphys/kiaf132

PMID:40265604

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12015685/

Abstract

Machine learning (ML) can accelerate biological research. However, the adoption of such tools to facilitate phenotyping based on sensor data has been limited by (i) the need for a large amount of human-annotated training data for each context in which the tool is used and (ii) phenotypes varying across contexts defined in terms of genetics and environment. This is a major bottleneck because acquiring training data is generally costly and time-consuming. This study demonstrates how a ML approach can address these challenges by minimizing the amount of human supervision needed for tool building. A case study was performed to compare ML approaches that examine images collected by an uncrewed aerial vehicle to determine the presence/absence of panicles (i.e. "heading") across thousands of field plots containing genetically diverse breeding populations of 2 Miscanthus species. Automated analysis of aerial imagery enabled the identification of heading approximately 9 times faster than in-field visual inspection by humans. Leveraging an Efficiently Supervised Generative Adversarial Network (ESGAN) learning strategy reduced the requirement for human-annotated data by 1 to 2 orders of magnitude compared to traditional, fully supervised learning approaches. The ESGAN model learned the salient features of the data set by using thousands of unlabeled images to inform the discriminative ability of a classifier so that it required minimal human-labeled training data. This method can accelerate the phenotyping of heading date as a measure of flowering time in Miscanthus across diverse contexts (e.g. in multistate trials) and opens avenues to promote the broad adoption of ML tools.

摘要

机器学习（ML）可以加速生物学研究。然而，基于传感器数据采用此类工具来促进表型分析受到了以下因素的限制：（i）在工具使用的每个背景下都需要大量人工标注的训练数据；（ii）表型会因遗传学和环境所定义的背景不同而有所变化。这是一个主要瓶颈，因为获取训练数据通常既昂贵又耗时。本研究展示了一种机器学习方法如何通过尽量减少工具构建所需的人工监督量来应对这些挑战。进行了一项案例研究，比较了多种机器学习方法，这些方法通过检查无人驾驶飞行器收集的图像，来确定数千个包含两种芒属植物遗传多样的育种群体的田间地块中是否存在圆锥花序（即“抽穗”）。航空图像的自动分析能够比人工实地目视检查快约9倍地识别抽穗情况。与传统的完全监督学习方法相比，利用高效监督生成对抗网络（ESGAN）学习策略将人工标注数据的需求减少了1至2个数量级。ESGAN模型通过使用数千张未标记图像来告知分类器的判别能力，从而学习数据集的显著特征，因此它只需要极少的人工标记训练数据。这种方法可以加速将抽穗日期作为芒属植物开花时间衡量指标的表型分析，适用于各种背景（例如在多州试验中），并为推动机器学习工具的广泛应用开辟了道路。