文本挖掘的变化格局：对生态学和进化学方法的综述。

The changing landscape of text mining: a review of approaches for ecology and evolution.

机构信息

Department of Ecology & Evolutionary Biology, University of Toronto, Toronto, Ontario, Canada.

School of Biodiversity, One Health & Veterinary Medicine, University of Glasgow, Glasgow, UK.

出版信息

Proc Biol Sci. 2024 Jul;291(2027):20240423. doi: 10.1098/rspb.2024.0423. Epub 2024 Jul 31.

DOI:10.1098/rspb.2024.0423

PMID:39082244

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11289731/

Abstract

In ecology and evolutionary biology, the synthesis and modelling of data from published literature are commonly used to generate insights and test theories across systems. However, the tasks of searching, screening, and extracting data from literature are often arduous. Researchers may manually process hundreds to thousands of articles for systematic reviews, meta-analyses, and compiling synthetic datasets. As relevant articles expand to tens or hundreds of thousands, computer-based approaches can increase the efficiency, transparency and reproducibility of literature-based research. Methods available for text mining are rapidly changing owing to developments in machine learning-based language models. We review the growing landscape of approaches, mapping them onto three broad paradigms (frequency-based approaches, traditional Natural Language Processing and deep learning-based language models). This serves as an entry point to learn foundational and cutting-edge concepts, vocabularies, and methods to foster integration of these tools into ecological and evolutionary research. We cover approaches for modelling ecological texts, generating training data, developing custom models and interacting with large language models and discuss challenges and possible solutions to implementing these methods in ecology and evolution.

摘要

在生态学和进化生物学中，综合和建模来自已发表文献的数据通常用于跨系统产生见解和检验理论。然而，从文献中搜索、筛选和提取数据的任务往往很艰巨。研究人员可能需要手动处理数百到数千篇文章，以进行系统评价、荟萃分析和编制综合数据集。随着相关文章扩展到数十万甚至上百万篇，基于计算机的方法可以提高基于文献的研究的效率、透明度和可重复性。由于基于机器学习的语言模型的发展，文本挖掘的方法正在迅速变化。我们回顾了不断发展的方法，将它们映射到三个广泛的范式（基于频率的方法、传统的自然语言处理和基于深度学习的语言模型）上。这是一个入门点，可以学习基础和前沿的概念、词汇和方法，以促进这些工具在生态学和进化研究中的整合。我们涵盖了用于对生态文本进行建模、生成训练数据、开发自定义模型以及与大型语言模型交互的方法，并讨论了在生态学和进化中实施这些方法的挑战和可能的解决方案。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/16ef/11289731/568f1f987102/rspb.2024.0423.f001.jpg

相似文献

The changing landscape of text mining: a review of approaches for ecology and evolution.文本挖掘的变化格局：对生态学和进化学方法的综述。

Proc Biol Sci. 2024 Jul;291(2027):20240423. doi: 10.1098/rspb.2024.0423. Epub 2024 Jul 31.

Past and future uses of text mining in ecology and evolution.文本挖掘在生态学和进化中的过去和未来用途。

Proc Biol Sci. 2022 May 25;289(1975):20212721. doi: 10.1098/rspb.2021.2721. Epub 2022 May 18.

Self-Attention-Based Models for the Extraction of Molecular Interactions from Biological Texts.基于自注意力机制的模型用于从生物文本中提取分子相互作用

Biomolecules. 2021 Oct 27;11(11):1591. doi: 10.3390/biom11111591.

Extractive text summarization system to aid data extraction from full text in systematic review development.用于从系统综述开发的全文中辅助数据提取的抽取式文本摘要系统。

J Biomed Inform. 2016 Dec;64:265-272. doi: 10.1016/j.jbi.2016.10.014. Epub 2016 Oct 27.

Text Mining and Machine Learning Protocol for Extracting Human-Related Protein Phosphorylation Information from PubMed.从 PubMed 中提取与人相关的蛋白质磷酸化信息的文本挖掘和机器学习协议。

Methods Mol Biol. 2022;2496:159-177. doi: 10.1007/978-1-0716-2305-3_9.

Machine Learning and Natural Language Processing in Mental Health: Systematic Review.机器学习和自然语言处理在心理健康中的应用：系统综述。

J Med Internet Res. 2021 May 4;23(5):e15708. doi: 10.2196/15708.

Is there such a thing as landscape genetics?是否存在景观遗传学这样的事物？

Mol Ecol. 2015 Jul;24(14):3518-28. doi: 10.1111/mec.13249. Epub 2015 Jun 19.

An Automated Literature Review Tool (LiteRev) for Streamlining and Accelerating Research Using Natural Language Processing and Machine Learning: Descriptive Performance Evaluation Study.一种使用自然语言处理和机器学习简化和加速研究的自动化文献综述工具（LiteRev）：描述性性能评估研究。

J Med Internet Res. 2023 Sep 15;25:e39736. doi: 10.2196/39736.

Identifying Patient Populations in Texts Describing Drug Approvals Through Deep Learning-Based Information Extraction: Development of a Natural Language Processing Algorithm.通过基于深度学习的信息提取在描述药物批准的文本中识别患者群体：一种自然语言处理算法的开发

JMIR Form Res. 2023 Jun 22;7:e44876. doi: 10.2196/44876.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT：一种用于生物医学文本挖掘的预训练生物医学语言表示模型。

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

引用本文的文献

From literature to biodiversity data: mining arthropod organismal traits with machine learning.从文献到生物多样性数据：利用机器学习挖掘节肢动物的机体特征

Biodivers Data J. 2025 Aug 5;13:e153070. doi: 10.3897/BDJ.13.e153070. eCollection 2025.

New opportunities and challenges for conservation evidence synthesis from advances in natural language processing.自然语言处理进展给保护证据综合带来的新机遇与挑战。

Conserv Biol. 2025 Apr;39(2):e14464. doi: 10.1111/cobi.14464.

Evaluating the feasibility of automating dataset retrieval for biodiversity monitoring.评估生物多样性监测中数据集检索自动化的可行性。

PeerJ. 2025 Jan 29;13:e18853. doi: 10.7717/peerj.18853. eCollection 2025.

本文引用的文献

Improving large language models for clinical named entity recognition via prompt engineering.通过提示工程改进临床命名实体识别的大型语言模型。

J Am Med Inform Assoc. 2024 Sep 1;31(9):1812-1820. doi: 10.1093/jamia/ocad259.

An extensive benchmark study on biomedical text generation and mining with ChatGPT.一项关于使用ChatGPT进行生物医学文本生成和挖掘的广泛基准研究。

Bioinformatics. 2023 Sep 2;39(9). doi: 10.1093/bioinformatics/btad557.

Reproducibility in ecology and evolution: Minimum standards for data and code.生态学与进化中的可重复性：数据和代码的最低标准。

Ecol Evol. 2023 May 10;13(5):e9961. doi: 10.1002/ece3.9961. eCollection 2023 May.

BiodivNERE: Gold standard corpora for named entity recognition and relation extraction in the biodiversity domain.生物多样性命名实体识别与关系抽取的黄金标准语料库：BiodivNERE

Biodivers Data J. 2022 Oct 7;10:e89481. doi: 10.3897/BDJ.10.e89481. eCollection 2022.

Past and future uses of text mining in ecology and evolution.文本挖掘在生态学和进化中的过去和未来用途。

Proc Biol Sci. 2022 May 25;289(1975):20212721. doi: 10.1098/rspb.2021.2721. Epub 2022 May 18.

A large-scale study on research code quality and execution.一项关于研究代码质量和执行情况的大规模研究。

Sci Data. 2022 Feb 21;9(1):60. doi: 10.1038/s41597-022-01143-6.

Text classification to streamline online wildlife trade analyses.文本分类可简化在线野生动植物贸易分析。

PLoS One. 2021 Jul 9;16(7):e0254007. doi: 10.1371/journal.pone.0254007. eCollection 2021.

Ten simple rules for getting started with command-line bioinformatics.开始使用命令行生物信息学的十条简单规则。

PLoS Comput Biol. 2021 Feb 18;17(2):e1008645. doi: 10.1371/journal.pcbi.1008645. eCollection 2021 Feb.

Data integration enables global biodiversity synthesis.数据集成促进全球生物多样性综合研究。

Proc Natl Acad Sci U S A. 2021 Feb 9;118(6). doi: 10.1073/pnas.2018093118.

Named Entity Recognition and Relation Detection for Biomedical Information Extraction.用于生物医学信息提取的命名实体识别与关系检测

Front Cell Dev Biol. 2020 Aug 28;8:673. doi: 10.3389/fcell.2020.00673. eCollection 2020.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

文本挖掘的变化格局：对生态学和进化学方法的综述。

The changing landscape of text mining: a review of approaches for ecology and evolution.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献