Zhang Yuanxin, Lin Sijie, Xiong Yaxin, Li Nan, Zhong Lijin, Ding Longzhen, Hu Qing
State Key Laboratory of Soil Pollution Control and Safety, Southern University of Science and Technology, Shenzhen, 518055, China.
School of Environmental Science and Engineering, Southern University of Science and Technology, Shenzhen, 518055, China.
Environ Sci Ecotechnol. 2025 Jul 28;27:100608. doi: 10.1016/j.ese.2025.100608. eCollection 2025 Sep.
Large language models (LLMs) are revolutionizing specialized fields by enabling advanced reasoning and data synthesis. Environmental science, however, poses unique hurdles due to its interdisciplinary scope, specialized jargon, and heterogeneous data from climate dynamics to ecosystem management. Despite progress in subdomains like hydrology and climate modeling, no integrated framework exists to generate high-quality, domain-specific training data or evaluate LLM performance across the discipline. Here we introduce a unified pipeline to address this gap. It comprises EnvInstruct, a multi-agent system for prompt generation; ChatEnv, a balanced 100-million-token instruction dataset spanning five core themes (climate change, ecosystems, water resources, soil management, and renewable energy); and EnvBench, a 4998-item benchmark assessing analysis, reasoning, calculation, and description tasks. Applying this pipeline, we fine-tune an 8-billion-parameter model, EnvGPT, which achieves 92.06 ± 1.85 % accuracy on the independent EnviroExam benchmark-surpassing the parameter-matched LLaMA-3.1-8B baseline by ∼8 percentage points and rivaling the closed-source GPT-4o-mini and the 9-fold larger Qwen2.5-72B. On EnvBench, EnvGPT earns top LLM-assigned scores for relevance (4.87 ± 0.11), factuality (4.70 ± 0.15), completeness (4.38 ± 0.19), and style (4.85 ± 0.10), outperforming baselines in every category. This study reveals how targeted supervised fine-tuning on curated domain data can propel compact LLMs to state-of-the-art levels, bridging gaps in environmental applications. By openly releasing EnvGPT, ChatEnv, and EnvBench, our work establishes a reproducible foundation for accelerating LLM adoption in environmental research, policy, and practice, with potential extensions to multimodal and real-time tools.
大语言模型(LLMs)通过实现高级推理和数据合成,正在彻底改变各个专业领域。然而,环境科学因其跨学科范围、专业术语以及从气候动态到生态系统管理的异构数据,带来了独特的障碍。尽管在水文学和气候建模等子领域取得了进展,但目前还没有一个综合框架来生成高质量的、特定领域的训练数据,或评估整个学科的大语言模型性能。在此,我们引入一个统一的流程来弥补这一差距。它包括用于提示生成的多智能体系统EnvInstruct;ChatEnv,一个包含五个核心主题(气候变化、生态系统、水资源、土壤管理和可再生能源)的平衡的1亿令牌指令数据集;以及EnvBench,一个包含4998个项目的基准,用于评估分析、推理、计算和描述任务。应用这个流程,我们对一个80亿参数的模型EnvGPT进行了微调,该模型在独立的EnvroExam基准测试中达到了92.06 ± 1.85%的准确率,比参数匹配的LLaMA - 3.1 - 8B基线高出约8个百分点,与闭源的GPT - 4o - mini以及大9倍的Qwen2.5 - 72B相媲美。在EnvBench上,EnvGPT在相关性(4.87 ± 0.11)、事实性(4.70 ± 0.15)、完整性(4.38 ± 0.19)和风格(4.85 ± 0.10)方面获得了大语言模型给出的最高分数,在每个类别中都优于基线。这项研究揭示了对精心策划的领域数据进行有针对性的监督微调如何能够将紧凑的大语言模型提升到先进水平,弥合环境应用中的差距。通过公开发布EnvGPT、ChatEnv和EnvBench,我们的工作为加速大语言模型在环境研究、政策和实践中的应用建立了一个可重复的基础,并有可能扩展到多模态和实时工具。