文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

第一步是最困难的:为大型语言模型表示和标记时间数据的陷阱。

The first step is the hardest: pitfalls of representing and tokenizing temporal data for large language models.

机构信息

Nokia Bell Labs, Cambridge, CB3 0FA, United Kingdom.

Department of Computer Science and Technology, University of Cambridge, Cambridge, CB3 0FD, United Kingdom.

出版信息

J Am Med Inform Assoc. 2024 Sep 1;31(9):2151-2158. doi: 10.1093/jamia/ocae090.


DOI:10.1093/jamia/ocae090
PMID:38950417
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11339515/
Abstract

OBJECTIVES: Large language models (LLMs) have demonstrated remarkable generalization and across diverse tasks, leading individuals to increasingly use them as personal assistants due to their emerging reasoning capabilities. Nevertheless, a notable obstacle emerges when including numerical/temporal data into these prompts, such as data sourced from wearables or electronic health records. LLMs employ tokenizers in their input that break down text into smaller units. However, tokenizers are not designed to represent numerical values and might struggle to understand repetitive patterns and context, treating consecutive values as separate tokens and disregarding their temporal relationships. This article discusses the challenges of representing and tokenizing temporal data. It argues that naively passing timeseries to LLMs can be ineffective due to the modality gap between numbers and text. MATERIALS AND METHODS: We conduct a case study by tokenizing a sample mobile sensing dataset using the OpenAI tokenizer. We also review recent works that feed timeseries data into LLMs for human-centric tasks, outlining common experimental setups like zero-shot prompting and few-shot learning. RESULTS: The case study shows that popular LLMs split timestamps and sensor values into multiple nonmeaningful tokens, indicating they struggle with temporal data. We find that preliminary works rely heavily on prompt engineering and timeseries aggregation to "ground" LLMs, hinting that the "modality gap" hampers progress. The literature was critically analyzed through the lens of models optimizing for expressiveness versus parameter efficiency. On one end of the spectrum, training large domain-specific models from scratch is expressive but not parameter-efficient. On the other end, zero-shot prompting of LLMs is parameter-efficient but lacks expressiveness for temporal data. DISCUSSION: We argue tokenizers are not optimized for numerical data, while the scarcity of timeseries examples in training corpora exacerbates difficulties. We advocate balancing model expressiveness and computational efficiency when integrating temporal data. Prompt tuning, model grafting, and improved tokenizers are highlighted as promising directions. CONCLUSION: We underscore that despite promising capabilities, LLMs cannot meaningfully process temporal data unless the input representation is addressed. We argue that this paradigm shift in how we leverage pretrained models will particularly affect the area of biomedical signals, given the lack of modality-specific foundation models.

摘要

目的:大型语言模型(LLM)在各种任务中表现出出色的泛化能力,由于其推理能力的不断增强,人们越来越将其用作个人助手。然而,当将数值/时间数据包含在这些提示中时,会出现一个明显的障碍,例如来自可穿戴设备或电子健康记录的数据。LLM 在其输入中使用标记器将文本分解成较小的单元。然而,标记器不是为表示数值而设计的,可能难以理解重复模式和上下文,将连续的值视为单独的标记,而忽略它们的时间关系。本文讨论了表示和标记时间数据的挑战。它认为,由于数字和文本之间的模态差距,将时间序列简单地传递给 LLM 可能效果不佳。

材料和方法:我们通过使用 OpenAI 标记器对示例移动感应数据集进行标记来进行案例研究。我们还回顾了最近将时间序列数据输入到 LLM 以进行以人为中心的任务的工作,概述了零镜头提示和少镜头学习等常见的实验设置。

结果:案例研究表明,流行的 LLM 将时间戳和传感器值分割成多个无意义的标记,这表明它们在处理时间数据方面存在困难。我们发现,初步工作严重依赖于提示工程和时间序列聚合来“接地”LLM,暗示“模态差距”阻碍了进展。通过从表达性和参数效率方面对模型进行优化的角度来对文献进行了批判性分析。在频谱的一端,从头开始训练大型特定领域的模型具有表达性但参数效率不高。在另一端,对 LLM 进行零镜头提示具有参数效率,但对时间数据缺乏表达性。

讨论:我们认为标记器不是为数值数据而优化的,而训练语料库中时间序列示例的稀缺性加剧了困难。我们主张在集成时间数据时平衡模型的表达能力和计算效率。提示调整、模型嫁接和改进的标记器被强调为有前途的方向。

结论:我们强调,尽管具有有前途的能力,但除非解决输入表示问题,否则 LLM 无法有意义地处理时间数据。我们认为,这种我们利用预训练模型的范式转变将特别影响生物医学信号领域,因为缺乏特定于模态的基础模型。

相似文献

[1]
The first step is the hardest: pitfalls of representing and tokenizing temporal data for large language models.

J Am Med Inform Assoc. 2024-9-1

[2]
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022-5-20

[3]
Adapting Safety Plans for Autistic Adults with Involvement from the Autism Community.

Autism Adulthood. 2025-5-28

[4]
Home treatment for mental health problems: a systematic review.

Health Technol Assess. 2001

[5]
How lived experiences of illness trajectories, burdens of treatment, and social inequalities shape service user and caregiver participation in health and social care: a theory-informed qualitative evidence synthesis.

Health Soc Care Deliv Res. 2025-6

[6]
Stigma Management Strategies of Autistic Social Media Users.

Autism Adulthood. 2025-5-28

[7]
"Just Ask What Support We Need": Autistic Adults' Feedback on Social Skills Training.

Autism Adulthood. 2025-5-28

[8]
Electric fans for reducing adverse health impacts in heatwaves.

Cochrane Database Syst Rev. 2012-7-11

[9]
Health professionals' experience of teamwork education in acute hospital settings: a systematic review of qualitative literature.

JBI Database System Rev Implement Rep. 2016-4

[10]
Evaluating and Improving Syndrome Differentiation Thinking Ability in Large Language Models: Method Development Study.

JMIR Med Inform. 2025-6-20

引用本文的文献

[1]
Large language models in biomedicine and health: current research landscape and future directions.

J Am Med Inform Assoc. 2024-9-1

本文引用的文献

[1]
Self-supervised learning for human activity recognition using 700,000 person-days of wearable data.

NPJ Digit Med. 2024-4-12

[2]
A foundational vision transformer improves diagnostic performance for electrocardiograms.

NPJ Digit Med. 2023-6-6

[3]
Breaking away from labels: The promise of self-supervised machine learning in intelligent health.

Patterns (N Y). 2022-2-11

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索