Department of Biochemistry, 410775All India Institute of Medical Sciences Bhubaneswar, Bhubaneswar, India.
Department of Radiodiagnosis, 410775All India Institute of Medical Sciences Bhubaneswar, Bhubaneswar, India.
Ann Clin Biochem. 2022 Jan;59(1):76-86. doi: 10.1177/00045632211046805. Epub 2021 Oct 6.
LDL-C is a strong risk factor for cardiovascular disorders. The formulas used to calculate LDL-C showed varying performance in different populations. Machine learning models can study complex interactions between the variables and can be used to predict outcomes more accurately. The current study evaluated the predictive performance of three machine learning models-random forests, XGBoost, and support vector Rregression (SVR) to predict LDL-C from total cholesterol, triglyceride, and HDL-C in comparison to linear regression model and some existing formulas for LDL-C calculation, in eastern Indian population.
The lipid profiles performed in the clinical biochemistry laboratory of AIIMS Bhubaneswar during 2019-2021, a total of 13,391 samples were included in the study. Laboratory results were collected from the laboratory database. 70% of data were classified as train set and used to develop the three machine learning models and linear regression formula. These models were tested in the rest 30% of the data (test set) for validation. Performance of models was evaluated in comparison to best six existing LDL-C calculating formulas.
LDL-C predicted by XGBoost and random forests models showed a strong correlation with directly estimated LDL-C (r = 0.98). Two machine learning models performed superior to the six existing and commonly used LDL-C calculating formulas like Friedewald in the study population. When compared in different triglycerides strata also, these two models outperformed the other methods used.
Machine learning models like XGBoost and random forests can be used to predict LDL-C with more accuracy comparing to conventional linear regression LDL-C formulas.
LDL-C 是心血管疾病的一个强有力的危险因素。用于计算 LDL-C 的公式在不同人群中的表现不同。机器学习模型可以研究变量之间复杂的相互作用,并能更准确地预测结果。本研究评估了三种机器学习模型(随机森林、XGBoost 和支持向量回归(SVR))在预测东印度人群中总胆固醇、甘油三酯和 HDL-C 计算 LDL-C 方面的性能,与线性回归模型和一些现有的 LDL-C 计算公式进行比较。
脂质谱在 AIIMS Bhubaneswar 的临床生化实验室进行,2019-2021 年共纳入 13391 例样本。实验室结果从实验室数据库中收集。数据的 70%被分类为训练集,用于开发三种机器学习模型和线性回归公式。这些模型在其余 30%的数据(测试集)中进行测试,以进行验证。模型的性能与最好的六种现有的 LDL-C 计算公式进行比较进行评估。
XGBoost 和随机森林模型预测的 LDL-C 与直接估计的 LDL-C 具有很强的相关性(r = 0.98)。这两种机器学习模型在研究人群中的表现优于六种现有的和常用的 LDL-C 计算公式,如 Friedewald。在不同的甘油三酯分层中进行比较时,这两种模型也优于其他使用的方法。
与传统的线性回归 LDL-C 公式相比,机器学习模型(如 XGBoost 和随机森林)可以更准确地预测 LDL-C。