Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.
ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.
Acc Chem Res. 2020 Mar 17;53(3):599-610. doi: 10.1021/acs.accounts.9b00470. Epub 2020 Feb 25.
The world needs new materials to stimulate the chemical industry in key sectors of our economy: environment and sustainability, information storage, optical telecommunications, and catalysis. Yet, nearly all functional materials are still discovered by "trial-and-error", of which the lack of predictability affords a major materials bottleneck to technological innovation. The average "molecule-to-market" lead time for materials discovery is currently 20 years. This is far too long for industrial needs, as highlighted by the Materials Genome Initiative, which has ambitious targets of up to 4-fold reductions in average molecule-to-market lead times. Such a large step change in progress can only be realistically achieved if one adopts an entirely new approach to materials discovery. Fortunately, a fundamentally new approach to materials discovery has been emerging, whereby data science with artificial intelligence offers a prospective solution to speed up these average molecule-to-market lead times.This approach is known as data-driven materials discovery. Its broad prospects have only recently become a reality, given the timely and major advances in "big data", artificial intelligence, and high-performance computing (HPC). Access to massive data sets has been stimulated by government-regulated open-access requirements for data and literature. Natural-language processing (NLP) and machine-learning (ML) tools that can mine data and find patterns therein are becoming mainstream. Exascale HPC capabilities that can aid data mining and pattern recognition and also generate their own data from calculations are now within our grasp. These timely advances present an ideal opportunity to develop data-driven materials-discovery strategies to systematically design and predict new chemicals for a given device application.This Account shows how data science can afford materials discovery via a four-step "design-to-device" pipeline that entails (1) data extraction, (2) data enrichment, (3) material prediction, and (4) experimental validation. Massive databases of cognate chemical and property information are first forged from "chemistry-aware" natural-language-processing tools, such as ChemDataExtractor, and enriched using machine-learning methods and high-throughput quantum-chemical calculations. New materials for a bespoke application can then be predicted by mining these databases with algorithmic encodings of relationships between chemical structures and physical properties that are known to deliver functional materials. These may take the form of classification, enumeration, or machine-learning algorithms. A data-mining workflow short-lists these predictions to a handful of lead candidate materials that go forward to experimental validation. This design-to-device approach is being developed to offer a roadmap for the accelerated discovery of new chemicals for functional applications. Case studies presented demonstrate its utility for photovoltaic, optical, and catalytic applications. While this Account is focused on applications in the physical sciences, the generic pipeline discussed is readily transferable to other scientific disciplines such as biology and medicine.
环境和可持续性、信息存储、光通信和催化。然而,几乎所有的功能材料仍然是通过“试错法”发现的,这种缺乏可预测性的方法给技术创新带来了主要的材料瓶颈。目前,材料发现的平均“从分子到市场”的前置时间为 20 年。对于工业需求来说,这太长了,正如材料基因组倡议所强调的那样,该倡议的目标是将平均“从分子到市场”的前置时间减少多达 4 倍。如果采用全新的材料发现方法,才能实现如此大的进展。幸运的是,一种全新的材料发现方法已经出现,即数据科学与人工智能相结合,为加速这些平均“从分子到市场”的前置时间提供了一个有前景的解决方案。这种方法被称为数据驱动的材料发现。由于“大数据”、人工智能和高性能计算(HPC)的及时和重大进展,这种方法的广阔前景才刚刚成为现实。政府监管的对数据和文献的开放获取要求刺激了对大规模数据集的访问。可以挖掘数据并发现其中模式的自然语言处理(NLP)和机器学习(ML)工具正在成为主流。现在我们已经掌握了能够帮助数据挖掘和模式识别以及从计算中生成自己的数据的 Exascale HPC 能力。这些及时的进展为开发数据驱动的材料发现策略提供了理想的机会,以系统地设计和预测给定器件应用的新化学物质。本账户展示了数据科学如何通过一个四步的“从设计到器件”的管道来实现材料发现,该管道包括(1)数据提取,(2)数据丰富,(3)材料预测,和(4)实验验证。首先从“化学感知”自然语言处理工具(如 ChemDataExtractor)中锻造出同源化学和属性信息的海量数据库,并使用机器学习方法和高通量量子化学计算对其进行丰富。然后,通过挖掘已知提供功能材料的化学结构和物理性质之间关系的算法编码,从这些数据库中预测出定制应用的新材料。这些可能采用分类、枚举或机器学习算法的形式。数据挖掘工作流程将这些预测筛选到少数几个领先的候选材料,这些材料将进入实验验证阶段。这种从设计到器件的方法正在被开发出来,为功能应用的新材料的加速发现提供了一个路线图。呈现的案例研究证明了它在光伏、光学和催化应用中的实用性。虽然本账户侧重于物理科学中的应用,但所讨论的通用管道很容易转移到生物学和医学等其他科学学科。