Venugopal Vineeth, Sahoo Sourav, Zaki Mohd, Agarwal Manish, Gosvami Nitya Nand, Krishnan N M Anoop
Department of Civil Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India.
Department of Materials Science and Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India.
Patterns (N Y). 2021 Jun 24;2(7):100290. doi: 10.1016/j.patter.2021.100290. eCollection 2021 Jul 9.
Most of the knowledge in materials science literature is in the form of unstructured data such as text and images. Here, we present a framework employing natural language processing, which automates text and image comprehension and precision knowledge extraction from inorganic glasses' literature. The abstracts are automatically categorized using latent Dirichlet allocation (LDA) to classify and search semantically linked publications. Similarly, a comprehensive summary of images and plots is presented using the caption cluster plot (CCP), providing direct access to images buried in the papers. Finally, we combine the LDA and CCP with chemical elements to present an elemental map, a topical and image-wise distribution of elements occurring in the literature. Overall, the framework presented here can be a generic and powerful tool to extract and disseminate material-specific information on composition-structure-processing-property dataspaces, allowing insights into fundamental problems relevant to the materials science community and accelerated materials discovery.
材料科学文献中的大部分知识都是以文本和图像等非结构化数据的形式存在的。在此,我们提出了一个采用自然语言处理的框架,该框架可自动实现对无机玻璃文献的文本和图像理解以及精确知识提取。使用潜在狄利克雷分配(LDA)对摘要进行自动分类,以对语义相关的出版物进行分类和搜索。同样,使用标题聚类图(CCP)对图像和图表进行全面总结,可直接访问论文中隐藏的图像。最后,我们将LDA和CCP与化学元素相结合,呈现出元素图谱,即文献中元素的主题和图像化分布。总体而言,这里提出的框架可以成为一个通用且强大的工具,用于提取和传播关于成分-结构-加工-性能数据空间的特定材料信息,从而深入了解与材料科学界相关的基本问题并加速材料发现。