Coveney Peter V, Dougherty Edward R, Highfield Roger R
Centre for Computational Science, University College London, Gordon Street, London WC1H 0AJ, UK
Center for Bioinformatics and Genomic Systems Engineering, Texas A&M University, College Station, TX 77843-31283, USA.
Philos Trans A Math Phys Eng Sci. 2016 Nov 13;374(2080). doi: 10.1098/rsta.2016.0153.
The current interest in big data, machine learning and data analytics has generated the widespread impression that such methods are capable of solving most problems without the need for conventional scientific methods of inquiry. Interest in these methods is intensifying, accelerated by the ease with which digitized data can be acquired in virtually all fields of endeavour, from science, healthcare and cybersecurity to economics, social sciences and the humanities. In multiscale modelling, machine learning appears to provide a shortcut to reveal correlations of arbitrary complexity between processes at the atomic, molecular, meso- and macroscales. Here, we point out the weaknesses of pure big data approaches with particular focus on biology and medicine, which fail to provide conceptual accounts for the processes to which they are applied. No matter their 'depth' and the sophistication of data-driven methods, such as artificial neural nets, in the end they merely fit curves to existing data. Not only do these methods invariably require far larger quantities of data than anticipated by big data aficionados in order to produce statistically reliable results, but they can also fail in circumstances beyond the range of the data used to train them because they are not designed to model the structural characteristics of the underlying system. We argue that it is vital to use theory as a guide to experimental design for maximal efficiency of data collection and to produce reliable predictive models and conceptual knowledge. Rather than continuing to fund, pursue and promote 'blind' big data projects with massive budgets, we call for more funding to be allocated to the elucidation of the multiscale and stochastic processes controlling the behaviour of complex systems, including those of life, medicine and healthcare.This article is part of the themed issue 'Multiscale modelling at the physics-chemistry-biology interface'.
当前对大数据、机器学习和数据分析的关注产生了一种普遍印象,即这些方法能够解决大多数问题,而无需传统的科学探究方法。对这些方法的兴趣正在增强,这得益于在几乎所有领域(从科学、医疗保健和网络安全到经济学、社会科学和人文学科)获取数字化数据的便捷性。在多尺度建模中,机器学习似乎提供了一条捷径,以揭示原子、分子、介观和宏观尺度上过程之间任意复杂程度的相关性。在这里,我们指出了纯大数据方法的弱点,特别关注生物学和医学领域,这些方法未能为其应用的过程提供概念性解释。无论它们的数据“深度”以及数据驱动方法(如人工神经网络)的复杂程度如何,最终它们只是对现有数据进行曲线拟合。这些方法不仅总是需要比大数据爱好者预期的多得多的数据量才能产生统计上可靠的结果,而且在超出用于训练它们的数据范围的情况下也可能失败,因为它们并非设计用于对基础系统的结构特征进行建模。我们认为,以理论为指导进行实验设计对于实现数据收集的最大效率以及产生可靠的预测模型和概念性知识至关重要。与其继续资助、开展和推广预算庞大的“盲目”大数据项目,我们呼吁将更多资金分配用于阐明控制复杂系统(包括生命、医学和医疗保健系统)行为的多尺度和随机过程。本文是主题为“物理 - 化学 - 生物学界面的多尺度建模”特刊的一部分。