Amouei Sheshkal Sajad, Gundersen Morten, Alexander Riegler Michael, Aass Utheim Øygunn, Gunnar Gundersen Kjell, Rootwelt Helge, Prestø Elgstøen Katja Benedikte, Lewi Hammer Hugo
Department of Computer Science, Oslo Metropolitan University, 0166 Oslo, Norway.
Department of Holistic Systems, SimulaMet, 0167 Oslo, Norway.
Diagnostics (Basel). 2024 Nov 29;14(23):2696. doi: 10.3390/diagnostics14232696.
Dry eye disease is a common disorder of the ocular surface, leading patients to seek eye care. Clinical signs and symptoms are currently used to diagnose dry eye disease. Metabolomics, a method for analyzing biological systems, has been found helpful in identifying distinct metabolites in patients and in detecting metabolic profiles that may indicate dry eye disease at early stages. In this study, we explored the use of machine learning and metabolomics data to identify cataract patients who suffer from dry eye disease, a topic that, to our knowledge, has not been previously explored. As there is no one-size-fits-all machine learning model for metabolomics data, choosing the most suitable model can significantly affect the quality of predictions and subsequent metabolomics analyses. To address this challenge, we conducted a comparative analysis of eight machine learning models on two metabolomics data sets from cataract patients with and without dry eye disease. The models were evaluated and optimized using nested k-fold cross-validation. To assess the performance of these models, we selected a set of suitable evaluation metrics tailored to the data set's challenges. The logistic regression model overall performed the best, achieving the highest area under the curve score of 0.8378, balanced accuracy of 0.735, Matthew's correlation coefficient of 0.5147, an F1-score of 0.8513, and a specificity of 0.5667. Additionally, following the logistic regression, the XGBoost and Random Forest models also demonstrated good performance. The results show that the logistic regression model with L2 regularization can outperform more complex models on an imbalanced data set with a small sample size and a high number of features, while also avoiding overfitting and delivering consistent performance across cross-validation folds. Additionally, the results demonstrate that it is possible to identify dry eye in cataract patients from tear film metabolomics data using machine learning models.
干眼症是一种常见的眼表疾病,会导致患者寻求眼部护理。目前临床症状和体征用于诊断干眼症。代谢组学作为一种分析生物系统的方法,已被证明有助于识别患者体内独特的代谢物,并检测可能在早期阶段指示干眼症的代谢谱。在本研究中,我们探索了使用机器学习和代谢组学数据来识别患有干眼症的白内障患者,据我们所知,这一主题此前尚未被探讨过。由于对于代谢组学数据不存在通用的机器学习模型,选择最合适的模型会显著影响预测质量和后续的代谢组学分析。为应对这一挑战,我们对来自患有和未患有干眼症的白内障患者的两个代谢组学数据集上的八个机器学习模型进行了比较分析。使用嵌套k折交叉验证对模型进行评估和优化。为评估这些模型的性能,我们选择了一组适合该数据集挑战的评估指标。逻辑回归模型总体表现最佳,曲线下面积得分最高,为0.8378,平衡准确率为0.735,马修斯相关系数为0.5147,F1分数为0.8513,特异性为0.5667。此外,在逻辑回归之后,XGBoost和随机森林模型也表现出良好的性能。结果表明,具有L2正则化的逻辑回归模型在样本量小、特征数量多的不平衡数据集上可以优于更复杂的模型,同时还能避免过拟合,并在交叉验证折中提供一致的性能。此外,结果表明使用机器学习模型从泪膜代谢组学数据中识别白内障患者的干眼症是可行的。