Suppr超能文献

压缩大数据分析:一种用于高维多源数据集的集成元算法。

Compressive Big Data Analytics: An ensemble meta-algorithm for high-dimensional multisource datasets.

作者信息

Marino Simeone, Zhao Yi, Zhou Nina, Zhou Yiwang, Toga Arthur W, Zhao Lu, Jian Yingsi, Yang Yichen, Chen Yehu, Wu Qiucheng, Wild Jessica, Cummings Brandon, Dinov Ivo D

机构信息

Statistics Online Computational Resource, Department of Health Behavior and Biological Sciences, University of Michigan, Ann Arbor, Michigan, United States of America.

Department of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan, United States of America.

出版信息

PLoS One. 2020 Aug 28;15(8):e0228520. doi: 10.1371/journal.pone.0228520. eCollection 2020.

Abstract

Health advances are contingent on continuous development of new methods and approaches to foster data-driven discovery in the biomedical and clinical sciences. Open-science and team-based scientific discovery offer hope for tackling some of the difficult challenges associated with managing, modeling, and interpreting of large, complex, and multisource data. Translating raw observations into useful information and actionable knowledge depends on effective domain-independent reproducibility, area-specific replicability, data curation, analysis protocols, organization, management and sharing of health-related digital objects. This study expands the functionality and utility of an ensemble semi-supervised machine learning technique called Compressive Big Data Analytics (CBDA). Applied to high-dimensional data, CBDA (1) identifies salient features and key biomarkers enabling reliable and reproducible forecasting of binary, multinomial and continuous outcomes (i.e., feature mining); and (2) suggests the most accurate algorithms/models for predictive analytics of the observed data (i.e., model mining). The method relies on iterative subsampling, combines function optimization and statistical inference, and generates ensemble predictions for observed univariate outcomes. The novelty of this study is highlighted by a new and expanded set of CBDA features including (1) efficiently handling extremely large datasets (>100,000 cases and >1,000 features); (2) generalizing the internal and external validation steps; (3) expanding the set of base-learners for joint ensemble prediction; (4) introducing an automated selection of CBDA specifications; and (5) providing mechanisms to assess CBDA convergence, evaluate the prediction accuracy, and measure result consistency. To ground the mathematical model and the corresponding computational algorithm, CBDA 2.0 validation utilizes synthetic datasets as well as a population-wide census-like study. Specifically, an empirical validation of the CBDA technique is based on a translational health research using a large-scale clinical study (UK Biobank), which includes imaging, cognitive, and clinical assessment data. The UK Biobank archive presents several difficult challenges related to the aggregation, harmonization, modeling, and interrogation of the information. These problems are related to the complex longitudinal structure, variable heterogeneity, feature multicollinearity, incongruency, and missingness, as well as violations of classical parametric assumptions. Our results show the scalability, efficiency, and usability of CBDA to interrogate complex data into structural information leading to derived knowledge and translational action. Applying CBDA 2.0 to the UK Biobank case-study allows predicting various outcomes of interest, e.g., mood disorders and irritability, and suggests new and exciting avenues of evidence-based research in the context of identifying, tracking, and treating mental health and aging-related diseases. Following open-science principles, we share the entire end-to-end protocol, source-code, and results. This facilitates independent validation, result reproducibility, and team-based collaborative discovery.

摘要

健康领域的进步取决于不断开发新的方法和途径,以促进生物医学和临床科学中数据驱动的发现。开放科学和基于团队的科学发现为应对与管理、建模和解释大型、复杂和多源数据相关的一些难题带来了希望。将原始观察结果转化为有用信息和可操作的知识,取决于有效的领域无关可重复性、特定领域可复制性、数据管理、分析协议、健康相关数字对象的组织、管理和共享。本研究扩展了一种称为压缩大数据分析(CBDA)的集成半监督机器学习技术的功能和实用性。应用于高维数据时,CBDA(1)识别显著特征和关键生物标志物,从而实现对二元、多项和连续结果的可靠且可重复的预测(即特征挖掘);(2)为观察到的数据的预测分析建议最准确的算法/模型(即模型挖掘)。该方法依赖于迭代子采样,结合函数优化和统计推断,并为观察到的单变量结果生成集成预测。本研究的新颖之处在于一组新的和扩展的CBDA特征,包括(1)有效处理极大型数据集(>100,000个案例和>1,000个特征);(2)泛化内部和外部验证步骤;(3)扩展用于联合集成预测的基础学习器集合;(4)引入CBDA规范的自动选择;(5)提供评估CBDA收敛性、评估预测准确性和测量结果一致性的机制。为了建立数学模型和相应的计算算法,CBDA 2.0验证利用了合成数据集以及类似全人群普查的研究。具体而言,CBDA技术的实证验证基于一项使用大规模临床研究(英国生物银行)的转化健康研究,该研究包括影像、认知和临床评估数据。英国生物银行档案提出了与信息的汇总、协调、建模和查询相关的几个难题。这些问题与复杂的纵向结构、变量异质性、特征多重共线性、不一致性和缺失性有关,以及违反经典参数假设。我们的结果表明,CBDA在将复杂数据转化为结构信息以产生衍生知识和转化行动方面具有可扩展性、效率和可用性。将CBDA 2.0应用于英国生物银行案例研究,可以预测各种感兴趣的结果,例如情绪障碍和易怒,并在识别、跟踪和治疗心理健康和衰老相关疾病的背景下提出基于证据的研究的新的令人兴奋的途径。遵循开放科学原则,我们分享了整个端到端协议、源代码和结果。这有助于独立验证、结果可重复性和基于团队的协作发现。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/301d/7455041/e887d473f828/pone.0228520.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验