利用大语言模型实现数据分析自动化。

Leveraging large language models for data analysis automation.

作者信息

Jansen Jacqueline A, Manukyan Artür, Al Khoury Nour, Akalin Altuna

机构信息

Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, Germany.

Bioinformatics and Omics Data Science Platform, Berlin, Germany.

出版信息

PLoS One. 2025 Feb 21;20(2):e0317084. doi: 10.1371/journal.pone.0317084. eCollection 2025.

DOI:10.1371/journal.pone.0317084

PMID:39982913

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11844886/

Abstract

Data analysis is constrained by a shortage of skilled experts, particularly in biology, where detailed data analysis and subsequent interpretation is vital for understanding complex biological processes and developing new treatments and diagnostics. One possible solution to this shortage in experts would be making use of Large Language Models (LLMs) for generating data analysis pipelines. However, although LLMs have shown great potential when used for code generation tasks, questions regarding the accuracy of LLMs when prompted with domain expert questions such as omics related data analysis questions, remain unanswered. To address this, we developed mergen, an R package that leverages LLMs for data analysis code generation and execution. We evaluated the performance of this data analysis system using various data analysis tasks for genomics. Our primary goal is to enable researchers to conduct data analysis by simply describing their objectives and the desired analyses for specific datasets through clear text. Our approach improves code generation via specialized prompt engineering and error feedback mechanisms. In addition, our system can execute the data analysis workflows prescribed by the LLM providing the results of the data analysis workflow for human review. Our evaluation of this system reveals that while LLMs effectively generate code for some data analysis tasks, challenges remain in executable code generation, especially for complex data analysis tasks. The best performance was seen with the self-correction mechanism, in which self-correct was able to increase the percentage of executable code when compared to the simple strategy by 22.5% for tasks of complexity 2. For tasks for complexity 3, 4 and 5, this increase was 52.5%, 27.5% and 15% respectively. Using a chi-squared test, it was shown that significant differences could be found using the different prompting strategies. Our study contributes to a better understanding of LLM capabilities and limitations, providing software infrastructure and practical insights for their effective integration into data analysis workflows.

摘要

数据分析受到熟练专家短缺的限制，尤其是在生物学领域，详细的数据分析及后续解读对于理解复杂的生物过程以及开发新的治疗方法和诊断手段至关重要。解决专家短缺问题的一个可能办法是利用大语言模型（LLMs）来生成数据分析管道。然而，尽管大语言模型在用于代码生成任务时已展现出巨大潜力，但当被诸如组学相关数据分析问题等领域专家问题提示时，大语言模型的准确性问题仍未得到解答。为解决此问题，我们开发了mergen，一个利用大语言模型进行数据分析代码生成与执行的R包。我们使用基因组学的各种数据分析任务评估了这个数据分析系统的性能。我们的主要目标是让研究人员能够通过清晰的文本简单描述他们的目标以及针对特定数据集所需的分析，从而进行数据分析。我们的方法通过专门的提示工程和错误反馈机制改进了代码生成。此外，我们的系统可以执行大语言模型规定的数据分析工作流程，提供数据分析工作流程的结果以供人工审核。我们对该系统的评估表明，虽然大语言模型能有效地为一些数据分析任务生成代码，但在可执行代码生成方面仍存在挑战，尤其是对于复杂的数据分析任务。自我校正机制表现出最佳性能，与简单策略相比，对于复杂度为2的任务，自我校正能够将可执行代码的百分比提高22.5%。对于复杂度为3、4和5的任务，这一增幅分别为52.5%、%27.5和15%。使用卡方检验表明，使用不同的提示策略可以发现显著差异。我们的研究有助于更好地理解大语言模型的能力和局限性，为将其有效集成到数据分析工作流程中提供软件基础设施和实践见解。