第一步是最困难的：为大型语言模型表示和标记时间数据的陷阱。

The first step is the hardest: pitfalls of representing and tokenizing temporal data for large language models.

机构信息

Nokia Bell Labs, Cambridge, CB3 0FA, United Kingdom.

Department of Computer Science and Technology, University of Cambridge, Cambridge, CB3 0FD, United Kingdom.

出版信息

J Am Med Inform Assoc. 2024 Sep 1;31(9):2151-2158. doi: 10.1093/jamia/ocae090.

OBJECTIVES: Large language models (LLMs) have demonstrated remarkable generalization and across diverse tasks, leading individuals to increasingly use them as personal assistants due to their emerging reasoning capabilities. Nevertheless, a notable obstacle emerges when including numerical/temporal data into these prompts, such as data sourced from wearables or electronic health records. LLMs employ tokenizers in their input that break down text into smaller units. However, tokenizers are not designed to represent numerical values and might struggle to understand repetitive patterns and context, treating consecutive values as separate tokens and disregarding their temporal relationships. This article discusses the challenges of representing and tokenizing temporal data. It argues that naively passing timeseries to LLMs can be ineffective due to the modality gap between numbers and text. MATERIALS AND METHODS: We conduct a case study by tokenizing a sample mobile sensing dataset using the OpenAI tokenizer. We also review recent works that feed timeseries data into LLMs for human-centric tasks, outlining common experimental setups like zero-shot prompting and few-shot learning. RESULTS: The case study shows that popular LLMs split timestamps and sensor values into multiple nonmeaningful tokens, indicating they struggle with temporal data. We find that preliminary works rely heavily on prompt engineering and timeseries aggregation to "ground" LLMs, hinting that the "modality gap" hampers progress. The literature was critically analyzed through the lens of models optimizing for expressiveness versus parameter efficiency. On one end of the spectrum, training large domain-specific models from scratch is expressive but not parameter-efficient. On the other end, zero-shot prompting of LLMs is parameter-efficient but lacks expressiveness for temporal data. DISCUSSION: We argue tokenizers are not optimized for numerical data, while the scarcity of timeseries examples in training corpora exacerbates difficulties. We advocate balancing model expressiveness and computational efficiency when integrating temporal data. Prompt tuning, model grafting, and improved tokenizers are highlighted as promising directions. CONCLUSION: We underscore that despite promising capabilities, LLMs cannot meaningfully process temporal data unless the input representation is addressed. We argue that this paradigm shift in how we leverage pretrained models will particularly affect the area of biomedical signals, given the lack of modality-specific foundation models.

目的：大型语言模型（LLM）在各种任务中表现出出色的泛化能力，由于其推理能力的不断增强，人们越来越将其用作个人助手。然而，当将数值/时间数据包含在这些提示中时，会出现一个明显的障碍，例如来自可穿戴设备或电子健康记录的数据。LLM 在其输入中使用标记器将文本分解成较小的单元。然而，标记器不是为表示数值而设计的，可能难以理解重复模式和上下文，将连续的值视为单独的标记，而忽略它们的时间关系。本文讨论了表示和标记时间数据的挑战。它认为，由于数字和文本之间的模态差距，将时间序列简单地传递给 LLM 可能效果不佳。

材料和方法：我们通过使用 OpenAI 标记器对示例移动感应数据集进行标记来进行案例研究。我们还回顾了最近将时间序列数据输入到 LLM 以进行以人为中心的任务的工作，概述了零镜头提示和少镜头学习等常见的实验设置。

结果：案例研究表明，流行的 LLM 将时间戳和传感器值分割成多个无意义的标记，这表明它们在处理时间数据方面存在困难。我们发现，初步工作严重依赖于提示工程和时间序列聚合来“接地”LLM，暗示“模态差距”阻碍了进展。通过从表达性和参数效率方面对模型进行优化的角度来对文献进行了批判性分析。在频谱的一端，从头开始训练大型特定领域的模型具有表达性但参数效率不高。在另一端，对 LLM 进行零镜头提示具有参数效率，但对时间数据缺乏表达性。

讨论：我们认为标记器不是为数值数据而优化的，而训练语料库中时间序列示例的稀缺性加剧了困难。我们主张在集成时间数据时平衡模型的表达能力和计算效率。提示调整、模型嫁接和改进的标记器被强调为有前途的方向。

结论：我们强调，尽管具有有前途的能力，但除非解决输入表示问题，否则 LLM 无法有意义地处理时间数据。我们认为，这种我们利用预训练模型的范式转变将特别影响生物医学信号领域，因为缺乏特定于模态的基础模型。

新学期，新优惠

Suppr 超能文献

新学期，新优惠

Suppr 超能文献

The first step is the hardest: pitfalls of representing and tokenizing temporal data for large language models.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

推荐工具