Li Yupeng, Dong Wei, Ru Boshu, Black Adam, Zhang Xinyuan, Guan Yuanfang
Merck & Co., Inc., Rahway, NJ, USA.
Ann Arbor Algorithms Inc., Ann Arbor, MI 48104, USA.
iScience. 2022 Aug 4;25(9):104880. doi: 10.1016/j.isci.2022.104880. eCollection 2022 Sep 16.
Many fields, including Natural Language Processing (NLP), have recently witnessed the benefit of pre-training with large generic datasets to improve the accuracy of prediction tasks. However, there exist key differences between the longitudinal healthcare data (, claims) and NLP tasks, which make the direct application of NLP pre-training methods to healthcare data inappropriate. In this article, we developed a pre-training scheme for longitudinal healthcare data that leverages the pairing of medical history and a future event. We then conducted systematic evaluations of various methods on ten patient-level prediction tasks encompassing adverse events, misdiagnosis, disease risks, and readmission. In addition to substantially reducing model size, our results show that a universal medical concept embedding pretrained with generic big data as well as carefully designed time decay modeling improves the accuracy of different downstream prediction tasks.
包括自然语言处理(NLP)在内的许多领域,最近都见证了使用大型通用数据集进行预训练对提高预测任务准确性的益处。然而,纵向医疗保健数据(如索赔数据)与NLP任务之间存在关键差异,这使得直接将NLP预训练方法应用于医疗保健数据并不合适。在本文中,我们开发了一种针对纵向医疗保健数据的预训练方案,该方案利用病史与未来事件的配对。然后,我们对涵盖不良事件、误诊、疾病风险和再入院的十个患者级预测任务的各种方法进行了系统评估。除了大幅减小模型规模外,我们的结果表明,用通用大数据预训练的通用医学概念嵌入以及精心设计的时间衰减建模提高了不同下游预测任务的准确性。