Hojda Maciej
Faculty of Information and Communication Technology, Wroclaw University of Science and Technology, Wyb. Wyspiańskiego 27, 50-370 Wrocław, Poland.
Sensors (Basel). 2025 Jul 13;25(14):4380. doi: 10.3390/s25144380.
The wide availability of sensor data stored in multiple formats makes it difficult to reuse in other applications. We consider the problem of extracting sensor data from unstructured and semi-structured texts using Large Language Models. With careful prompt crafting, we have been able to establish a strict JSON structure which can be further processed with automated ease. We establish a workflow that enables the extraction of data using GPT-4, Llama 3, Mistral and Falcon models, and we show that while the closed-source GPT-4 model is generally leading in conversion efficiency, other open-source models can follow this if given appropriate data structures. We define new measures to simplify the comparison, and we present a multi-purpose workflow for sensor data extraction. We observe that some of the smaller models are incapable of correctly extracting data from freeform text but are skilled in processing tabular data. On the other hand, larger models are more robust and avoid conversion mistakes more easily.
以多种格式存储的传感器数据广泛可得,这使得其难以在其他应用中复用。我们考虑使用大语言模型从非结构化和半结构化文本中提取传感器数据的问题。通过精心设计提示,我们能够建立一个严格的JSON结构,该结构可以轻松地进行自动化进一步处理。我们建立了一个工作流程,该流程能够使用GPT-4、Llama 3、Mistral和Falcon模型提取数据,并且我们表明,虽然闭源的GPT-4模型在转换效率方面通常领先,但如果给定适当的数据结构,其他开源模型也可以做到这一点。我们定义了新的度量标准以简化比较,并提出了一种用于传感器数据提取的多用途工作流程。我们观察到,一些较小的模型无法从自由格式文本中正确提取数据,但在处理表格数据方面很熟练。另一方面,较大的模型更稳健,更容易避免转换错误。