Ha Eunyong, Ha Seung Min, Gerelkhuu Zayakhuu, Kim Hyun-Yi, Yoon Tae Hyun
Department of Chemistry, Hanyang University, Seoul 04763, Republic of Korea.
Research Institute for Convergence of Basic Science, Hanyang University, Seoul 04763, Republic of Korea.
Comput Struct Biotechnol J. 2025 Apr 3;29:138-148. doi: 10.1016/j.csbj.2025.03.052. eCollection 2025.
With the growing use of nanomaterials (NMs), assessing their toxicity has become increasingly important. Among toxicity assessment methods, computational models for predicting nanotoxicity are emerging as alternatives to traditional in vitro and in vivo assays, which involve high costs and ethical concerns. As a result, the qualitative and quantitative importance of data is now widely recognized. However, collecting large, high-quality data is both time-consuming and labor-intensive. Artificial intelligence (AI)-based data extraction techniques hold significant potential for extracting and organizing information from unstructured text. However, the use of large language models (LLMs) and prompt engineering for nanotoxicity data extraction has not been widely studied. In this study, we developed an AI-based automated data extraction pipeline to facilitate efficient data collection. The automation process was implemented using Python-based LangChain. We used 216 nanotoxicity research articles as training data to refine prompts and evaluate LLM performance. Subsequently, the most suitable LLM with refined prompts was used to extract test data, from 605 research articles. As a result, data extraction performance on training data achieved F1 (F1 score for Data Extraction) ranging from 84.6 % to 87.6 % across different LLMs. Furthermore, using the extracted dataset from test set, we constructed automated machine learning (AutoML) models that achieved F1 (F1 score for Nanotoxicity Prediction) exceeding 86.1 % in predicting nanotoxicity. Additionally, we assessed the reliability and applicability of models by comparing them in terms of ground truth, size, and balance. This study highlights the potential of AI-based data extraction, representing a significant contribution to nanotoxicity research.
随着纳米材料(NMs)的使用日益增加,评估其毒性变得越来越重要。在毒性评估方法中,预测纳米毒性的计算模型正作为传统体外和体内试验的替代方法出现,传统试验涉及高成本和伦理问题。因此,数据的定性和定量重要性现在已得到广泛认可。然而,收集大量高质量数据既耗时又费力。基于人工智能(AI)的数据提取技术在从非结构化文本中提取和组织信息方面具有巨大潜力。然而,大语言模型(LLMs)和提示工程在纳米毒性数据提取中的应用尚未得到广泛研究。在本研究中,我们开发了一种基于AI的自动化数据提取管道,以促进高效的数据收集。自动化过程使用基于Python的LangChain实现。我们使用216篇纳米毒性研究文章作为训练数据来优化提示并评估LLM性能。随后,使用具有优化提示的最合适的LLM从605篇研究文章中提取测试数据。结果,不同LLMs在训练数据上的数据提取性能实现了F1(数据提取的F1分数)在84.6%至87.6%之间。此外,使用从测试集中提取的数据集,我们构建了自动化机器学习(AutoML)模型,其在预测纳米毒性方面的F1(纳米毒性预测的F1分数)超过86.1%。此外,我们通过在基本事实、大小和平衡方面进行比较来评估模型的可靠性和适用性。本研究突出了基于AI的数据提取的潜力,为纳米毒性研究做出了重大贡献。