Oyama Katsunori, Isogai Toshiki, Nakayama Yohei, Kobayashi Ryoki, Kitano Daisuke, Karako Kenji, Sakatani Kaoru
Department of Computer Science, College of Engineering, Nihon University, Koriyama, Japan.
Graduate School of Computer Science, Nihon University, Koriyama, Japan.
Front Neurol. 2024 Aug 14;15:1379916. doi: 10.3389/fneur.2024.1379916. eCollection 2024.
This study aimed to investigate the effectiveness of data augmentation to improve dementia risk prediction using machine learning models. Recent studies have shown that basic blood tests are cost-effective in predicting cognitive function. However, developing models that address various conditions poses challenges due to constraints associated with blood test results and cognitive assessments, including high costs, limited sample sizes, and missing data from tests not performed in certain facilities. Despite being often limited by small sample sizes, periodontal examination data have also emerged as a cost-effective screening tool.
To address these challenges, this study explored the effectiveness of data augmentation using the Synthetic Minority Over-sampling Technique for Regression with Gaussian noise (SMOGN), a Generative Adversarial Network (GAN), and a Conditional Tabular GAN (CTGAN) on periodontal examination and blood test data. The datasets included parameters such as cognitive assessment results from the Mini-Mental State Examination (MMSE), demographic characteristics, periodontal examination data, and blood test results. Linear regression models, random forests, and deep neural networks were used to evaluate the effectiveness of the synthesized data.
This study used measured data from 108 participants and the synthesized data generated from the measured data. External validity was evaluated using a different dataset of 41 participants with missing items. The results suggested that normal GANs have the advantage of investigating models in data diversity, whereas CTGANs preserve the data structure and linear relationships in tabular data from the measured data, which drastically improves linear regression models.
Importantly, by interpolating sparse areas in the distribution, such as age, the synthesized models maintained prediction accuracy for test data with extreme inputs. These findings suggest that GAN-synthesized data can effectively address regression problems and improve dementia risk prediction.
本研究旨在探讨数据增强对使用机器学习模型改善痴呆风险预测的有效性。最近的研究表明,基本血液检测在预测认知功能方面具有成本效益。然而,由于血液检测结果和认知评估相关的限制,包括高成本、样本量有限以及某些机构未进行检测导致的数据缺失,开发适用于各种情况的模型面临挑战。尽管牙周检查数据通常受样本量小的限制,但它也已成为一种具有成本效益的筛查工具。
为应对这些挑战,本研究探讨了使用带高斯噪声的合成少数过采样技术进行回归(SMOGN)、生成对抗网络(GAN)和条件表格GAN(CTGAN)对牙周检查和血液检测数据进行数据增强的有效性。数据集包括简易精神状态检查表(MMSE)的认知评估结果、人口统计学特征、牙周检查数据和血液检测结果等参数。使用线性回归模型、随机森林和深度神经网络来评估合成数据的有效性。
本研究使用了108名参与者的实测数据以及从实测数据生成的合成数据。使用41名有缺失项目的不同数据集评估外部有效性。结果表明普通GAN在研究数据多样性方面具有优势,而CTGAN能保留实测数据中表格数据的数据结构和线性关系,这极大地改进了线性回归模型。
重要的是,通过在分布中的稀疏区域(如年龄)进行插值,合成模型对具有极端输入的测试数据保持了预测准确性。这些发现表明GAN合成的数据可以有效解决回归问题并改善痴呆风险预测。