Materials Science Division, Lawrence Livermore National Laboratory, 7000 East Avenue, Livermore, California 94550, United States.
Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, 7000 East Avenue, Livermore, California 94550, United States.
J Chem Inf Model. 2020 Jun 22;60(6):2876-2887. doi: 10.1021/acs.jcim.0c00199. Epub 2020 Apr 29.
Nanomaterials of varying compositions and morphologies are of interest for many applications from catalysis to optics, but the synthesis of nanomaterials and their scale-up are most often time-consuming and Edisonian processes. Information gleaned from the scientific literature can help inform and accelerate nanomaterials development, but again, searching the literature and digesting the information are time-consuming manual processes for researchers. To help address these challenges, we developed scientific article-processing tools that extract and structure information from the text and figures of nanomaterials articles, thereby enabling the creation of a personalized knowledgebase for nanomaterials synthesis that can be mined to help inform further nanomaterials development. Starting with a corpus of ∼35k nanomaterials-related articles, we developed models to classify articles according to the nanomaterial composition and morphology, extract synthesis protocols from within the articles' text, and extract, normalize, and categorize chemical terms within synthesis protocols. We demonstrate the efficiency of the proposed pipeline on an expert-labeled set of nanomaterials synthesis articles, achieving 100% accuracy on composition prediction, 95% accuracy on morphology prediction, 0.99 AUC on protocol identification, and up to a 0.87 F1-score on chemical entity recognition. In addition to processing articles' text, microscopy images of nanomaterials within the articles are also automatically identified and analyzed to determine the nanomaterials' morphologies and size distributions. To enable users to easily explore the database, we developed a complementary browser-based visualization tool that provides flexibility in comparing across subsets of articles of interest. We use these tools and information to identify trends in nanomaterials synthesis, such as the correlation of certain reagents with various nanomaterial morphologies, which is useful in guiding hypotheses and reducing the potential parameter space during experimental design.
不同组成和形态的纳米材料在催化到光学等许多应用中都很有吸引力,但纳米材料的合成及其扩大规模通常是耗时且需要反复试验的过程。从科学文献中收集到的信息可以为纳米材料的开发提供信息并加速其发展,但同样,搜索文献和消化信息对研究人员来说也是耗时的手动过程。为了帮助解决这些挑战,我们开发了科学文章处理工具,这些工具可以从纳米材料文章的文本和图像中提取和构建信息,从而为纳米材料合成创建一个个性化的知识库,以便挖掘这些信息来帮助进一步指导纳米材料的开发。从一个约 35k 的纳米材料相关文章的语料库开始,我们开发了模型来根据纳米材料的组成和形态对文章进行分类,从文章的文本中提取合成方案,并提取、规范化和分类合成方案中的化学术语。我们在一组经过专家标记的纳米材料合成文章上展示了该方法的效率,在组成预测方面达到了 100%的准确率,在形态预测方面达到了 95%的准确率,在方案识别方面的 AUC 达到了 0.99,在化学实体识别方面的 F1 分数最高可达 0.87。除了处理文章的文本之外,文章中的纳米材料显微镜图像也会被自动识别和分析,以确定纳米材料的形态和尺寸分布。为了使用户能够轻松地探索数据库,我们开发了一个基于浏览器的互补可视化工具,该工具在比较感兴趣的文章子集时提供了灵活性。我们使用这些工具和信息来识别纳米材料合成中的趋势,例如某些试剂与各种纳米材料形态之间的相关性,这有助于指导假说,并在实验设计中减少潜在的参数空间。