Department of Ecology & Evolutionary Biology, University of Toronto, Toronto, Ontario, Canada.
School of Biodiversity, One Health & Veterinary Medicine, University of Glasgow, Glasgow, UK.
Proc Biol Sci. 2024 Jul;291(2027):20240423. doi: 10.1098/rspb.2024.0423. Epub 2024 Jul 31.
In ecology and evolutionary biology, the synthesis and modelling of data from published literature are commonly used to generate insights and test theories across systems. However, the tasks of searching, screening, and extracting data from literature are often arduous. Researchers may manually process hundreds to thousands of articles for systematic reviews, meta-analyses, and compiling synthetic datasets. As relevant articles expand to tens or hundreds of thousands, computer-based approaches can increase the efficiency, transparency and reproducibility of literature-based research. Methods available for text mining are rapidly changing owing to developments in machine learning-based language models. We review the growing landscape of approaches, mapping them onto three broad paradigms (frequency-based approaches, traditional Natural Language Processing and deep learning-based language models). This serves as an entry point to learn foundational and cutting-edge concepts, vocabularies, and methods to foster integration of these tools into ecological and evolutionary research. We cover approaches for modelling ecological texts, generating training data, developing custom models and interacting with large language models and discuss challenges and possible solutions to implementing these methods in ecology and evolution.
在生态学和进化生物学中,综合和建模来自已发表文献的数据通常用于跨系统产生见解和检验理论。然而,从文献中搜索、筛选和提取数据的任务往往很艰巨。研究人员可能需要手动处理数百到数千篇文章,以进行系统评价、荟萃分析和编制综合数据集。随着相关文章扩展到数十万甚至上百万篇,基于计算机的方法可以提高基于文献的研究的效率、透明度和可重复性。由于基于机器学习的语言模型的发展,文本挖掘的方法正在迅速变化。我们回顾了不断发展的方法,将它们映射到三个广泛的范式(基于频率的方法、传统的自然语言处理和基于深度学习的语言模型)上。这是一个入门点,可以学习基础和前沿的概念、词汇和方法,以促进这些工具在生态学和进化研究中的整合。我们涵盖了用于对生态文本进行建模、生成训练数据、开发自定义模型以及与大型语言模型交互的方法,并讨论了在生态学和进化中实施这些方法的挑战和可能的解决方案。