Suppr超能文献

一种基于核的新方法,用于处理任意长度的符号数据,并应用于 2 型糖尿病风险。

A novel kernel based approach to arbitrary length symbolic data with application to type 2 diabetes risk.

机构信息

Department of Computer Science, School of Science and Technology, Middlesex University, London, NW4 4BT, UK.

Centre for Vision Speech and Signal Processing Alan Turing Building (BB), University of Surrey, Guildford, Surrey, GU2 7XH, UK.

出版信息

Sci Rep. 2022 Mar 23;12(1):4985. doi: 10.1038/s41598-022-08757-1.

Abstract

Predictive modeling of clinical data is fraught with challenges arising from the manner in which events are recorded. Patients typically fall ill at irregular intervals and experience dissimilar intervention trajectories. This results in irregularly sampled and uneven length data which poses a problem for standard multivariate tools. The alternative of feature extraction into equal-length vectors via methods like Bag-of-Words (BoW) potentially discards useful information. We propose an approach based on a kernel framework in which data is maintained in its native form: discrete sequences of symbols. Kernel functions derived from the edit distance between pairs of sequences may then be utilized in conjunction with support vector machines to classify the data. Our method is evaluated in the context of the prediction task of determining patients likely to develop type 2 diabetes following an earlier episode of elevated blood pressure of 130/80 mmHg. Kernels combined via multi kernel learning achieved an F1-score of 0.96, outperforming classification with SVM 0.63, logistic regression 0.63, Long Short Term Memory 0.61 and Multi-Layer Perceptron 0.54 applied to a BoW representation of the data. We achieved an F1-score of 0.97 on MKL on external dataset. The proposed approach is consequently able to overcome limitations associated with feature-based classification in the context of clinical data.

摘要

临床数据的预测建模充满了挑战,这些挑战源于事件记录的方式。患者通常会不定期生病,并经历不同的干预轨迹。这导致数据采样不规则且长度不均,这对标准多元工具构成了问题。通过类似于词袋 (BoW) 的方法将特征提取到等长向量的替代方法可能会丢弃有用的信息。我们提出了一种基于核框架的方法,其中数据以其原始形式(符号的离散序列)保留。然后,可以使用来自序列对之间编辑距离的核函数与支持向量机结合使用来对数据进行分类。我们的方法在预测任务中进行了评估,该任务是确定在先前出现 130/80mmHg 的高血压事件后可能发展为 2 型糖尿病的患者。通过多核学习组合的核达到了 0.96 的 F1 分数,优于 SVM 0.63、逻辑回归 0.63、长短期记忆 0.61 和多层感知机 0.54 在数据的 BoW 表示形式上的分类。我们在外部数据集上的多核学习中达到了 0.97 的 F1 分数。因此,该方法能够克服在临床数据中基于特征的分类相关的限制。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/73e0/8943170/3abd80d85cb9/41598_2022_8757_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验