A computational pipeline for data augmentation towards the improvement of disease classification and risk stratification models: A case study in two clinical domains.

作者信息

Pezoulas Vasileios C, Grigoriadis Grigoris I, Gkois George, Tachos Nikolaos S, Smole Tim, Bosnić Zoran, Pičulin Matej, Olivotto Iacopo, Barlocco Fausto, Robnik-Šikonja Marko, Jakovljevic Djordje G, Goules Andreas, Tzioufas Athanasios G, Fotiadis Dimitrios I

机构信息

Unit of Medical Technology and Intelligent Information Systems, Department of Materials Science and Engineering, University of Ioannina, Ioannina, GR45110, Greece.

Faculty of Computer and Information Science, University of Ljubljana, Večna Pot 113, 1000, Ljubljana, Slovenia.

出版信息

Comput Biol Med. 2021 Jul;134:104520. doi: 10.1016/j.compbiomed.2021.104520. Epub 2021 Jun 6.

DOI:10.1016/j.compbiomed.2021.104520

PMID:34118751

Abstract

Virtual population generation is an emerging field in data science with numerous applications in healthcare towards the augmentation of clinical research databases with significant lack of population size. However, the impact of data augmentation on the development of AI (artificial intelligence) models to address clinical unmet needs has not yet been investigated. In this work, we assess whether the aggregation of real with virtual patient data can improve the performance of the existing risk stratification and disease classification models in two rare clinical domains, namely the primary Sjögren's Syndrome (pSS) and the hypertrophic cardiomyopathy (HCM), for the first time in the literature. To do so, multivariate approaches, such as, the multivariate normal distribution (MVND), and straightforward ones, such as, the Bayesian networks, the artificial neural networks (ANNs), and the tree ensembles are compared against their performance towards the generation of high-quality virtual data. Both boosting and bagging algorithms, such as, the Gradient boosting trees (XGBoost), the AdaBoost and the Random Forests (RFs) were trained on the augmented data to evaluate the performance improvement for lymphoma classification and HCM risk stratification. Our results revealed the favorable performance of the tree ensemble generators, in both domains, yielding virtual data with goodness-of-fit 0.021 and KL-divergence 0.029 in pSS and 0.029, 0.027 in HCM, respectively. The application of the XGBoost on the augmented data revealed an increase by 10.9% in accuracy, 10.7% in sensitivity, 11.5% in specificity for lymphoma classification and 16.1% in accuracy, 16.9% in sensitivity, 13.7% in specificity in HCM risk stratification.

摘要

相似文献

A computational pipeline for data augmentation towards the improvement of disease classification and risk stratification models: A case study in two clinical domains.

Comput Biol Med. 2021 Jul;134:104520. doi: 10.1016/j.compbiomed.2021.104520. Epub 2021 Jun 6.

Variational Gaussian Mixture Models with robust Dirichlet concentration priors for virtual population generation in hypertrophic cardiomyopathy: a comparison study.基于稳健 Dirichlet 浓度先验的变分高斯混合模型在肥厚型心肌病虚拟人群生成中的比较研究。

Annu Int Conf IEEE Eng Med Biol Soc. 2021 Nov;2021:1674-1677. doi: 10.1109/EMBC46164.2021.9629653.

A federated AI strategy for the classification of patients with Mucosa Associated Lymphoma Tissue (MALT) lymphoma across multiple harmonized cohorts.多组学队列中黏膜相关淋巴组织（MALT）淋巴瘤患者分类的联邦人工智能策略。

Annu Int Conf IEEE Eng Med Biol Soc. 2021 Nov;2021:1666-1669. doi: 10.1109/EMBC46164.2021.9630014.

Classification of genomic islands using decision trees and their ensemble algorithms.基于决策树及其集成算法的基因组岛分类。

BMC Genomics. 2010 Nov 2;11 Suppl 2(Suppl 2):S1. doi: 10.1186/1471-2164-11-S2-S1.

Learning ensembles of neural networks by means of a Bayesian artificial immune system.借助贝叶斯人工免疫系统学习神经网络集成

IEEE Trans Neural Netw. 2011 Feb;22(2):304-16. doi: 10.1109/TNN.2010.2096823. Epub 2010 Dec 23.

A Theoretical Analysis of Why Hybrid Ensembles Work.关于混合集成模型为何有效的理论分析。

Comput Intell Neurosci. 2017;2017:1930702. doi: 10.1155/2017/1930702. Epub 2017 Jan 31.

Artificial Intelligence Algorithms to Diagnose Glaucoma and Detect Glaucoma Progression: Translation to Clinical Practice.用于诊断青光眼和检测青光眼病情进展的人工智能算法：向临床实践的转化

Transl Vis Sci Technol. 2020 Oct 15;9(2):55. doi: 10.1167/tvst.9.2.55. eCollection 2020 Oct.

Generation of virtual patient data for in-silico cardiomyopathies drug development using tree ensembles: a comparative study.使用树集成生成虚拟患者数据用于计算机模拟心肌病药物开发：一项比较研究

Annu Int Conf IEEE Eng Med Biol Soc. 2020 Jul;2020:5343-5346. doi: 10.1109/EMBC44109.2020.9176567.

BgN-Score and BsN-Score: bagging and boosting based ensemble neural networks scoring functions for accurate binding affinity prediction of protein-ligand complexes.BgN分数和BsN分数：基于装袋法和提升法的集成神经网络评分函数，用于准确预测蛋白质-配体复合物的结合亲和力。

BMC Bioinformatics. 2015;16 Suppl 4(Suppl 4):S8. doi: 10.1186/1471-2105-16-S4-S8. Epub 2015 Feb 23.

Addressing the clinical unmet needs in primary Sjögren's Syndrome through the sharing, harmonization and federated analysis of 21 European cohorts.通过对21个欧洲队列的共享、协调和联合分析来满足原发性干燥综合征临床未满足的需求。

Comput Struct Biotechnol J. 2022 Jan 7;20:471-484. doi: 10.1016/j.csbj.2022.01.002. eCollection 2022.

引用本文的文献

Synthetic data generation methods in healthcare: A review on open-source tools and methods.医疗保健领域的合成数据生成方法：关于开源工具和方法的综述

Comput Struct Biotechnol J. 2024 Jul 9;23:2892-2910. doi: 10.1016/j.csbj.2024.07.005. eCollection 2024 Dec.

CADUCEO: A Platform to Support Federated Healthcare Facilities through Artificial Intelligence.CADUCEO：一个通过人工智能支持联合医疗保健机构的平台。

Healthcare (Basel). 2023 Aug 4;11(15):2199. doi: 10.3390/healthcare11152199.

A practical solution to estimate the sample size required for clinical prediction models generated from observational research on data.一种实用的方法，用于估计从基于数据的观察性研究中生成的临床预测模型所需的样本量。

Eur Radiol Exp. 2022 Jun 1;6(1):22. doi: 10.1186/s41747-022-00276-y.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验