Department of Biostatistics, School of Public Health, Yale University, New Haven, USA.
The Jackson School of Global Affairs, Yale University, New Haven, USA.
BMC Bioinformatics. 2023 Dec 17;24(1):482. doi: 10.1186/s12859-023-05597-2.
This paper presents novel datasets providing numerical representations of ICD-10-CM codes by generating description embeddings using a large language model followed by a dimension reduction via autoencoder. The embeddings serve as informative input features for machine learning models by capturing relationships among categories and preserving inherent context information. The model generating the data was validated in two ways. First, the dimension reduction was validated using an autoencoder, and secondly, a supervised model was created to estimate the ICD-10-CM hierarchical categories. Results show that the dimension of the data can be reduced to as few as 10 dimensions while maintaining the ability to reproduce the original embeddings, with the fidelity decreasing as the reduced-dimension representation decreases. Multiple compression levels are provided, allowing users to choose as per their requirements, download and use without any other setup. The readily available datasets of ICD-10-CM codes are anticipated to be highly valuable for researchers in biomedical informatics, enabling more advanced analyses in the field. This approach has the potential to significantly improve the utility of ICD-10-CM codes in the biomedical domain.
本文提出了新颖的数据集,通过使用大型语言模型生成 ICD-10-CM 代码的描述嵌入,并通过自动编码器进行降维,从而提供 ICD-10-CM 代码的数值表示。这些嵌入通过捕捉类别之间的关系并保留固有上下文信息,成为机器学习模型的有价值的输入特征。用于生成数据的模型通过两种方式进行了验证。首先,通过自动编码器验证了降维,其次,创建了一个监督模型来估计 ICD-10-CM 层次类别。结果表明,数据的维度可以减少到 10 维,同时保持重现原始嵌入的能力,随着降维表示的减少,保真度降低。提供了多个压缩级别,允许用户根据需要选择,无需其他设置即可下载和使用。预计这些易于获取的 ICD-10-CM 代码数据集对于生物医学信息学研究人员非常有价值,能够在该领域进行更先进的分析。这种方法有可能显著提高 ICD-10-CM 代码在生物医学领域的实用性。