Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States.
Materials Science and Engineering Division, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, Maryland 20899, United States.
J Chem Inf Model. 2024 Aug 26;64(16):6464-6476. doi: 10.1021/acs.jcim.4c00242. Epub 2024 Aug 10.
The Block Copolymer Database (BCDB) is a platform that allows users to search, submit, visualize, benchmark, and download experimental phase measurements and their associated characterization information for di- and multiblock copolymers. To the best of our knowledge, there is no widely accepted data model for publishing experimental and simulation data on block copolymer self-assembly. This proposed data schema with traceable information can accommodate any number of blocks and at the time of publication contains over 5400 block copolymer total melt phase measurements mined from the literature and manually curated and simulation data points of the phase diagram generated from self-consistent field theory that can rapidly be augmented. This database can be accessed via the Community Resource for Innovation in Polymer Technology (CRIPT) web application and the Materials Data Facility. The chemical structure of the polymer is encoded in BigSMILES, an extension of the Simplified Molecular-Input Line-Entry System (SMILES) into the macromolecular domain, and the user can search repeat units and functional groups using the SMARTS search syntax (SMILES Arbitrary Target Specification). The user can also query characterization and phase information using Structured Query Language (SQL) and download custom sets of block copolymer data to train machine learning models. Finally, a protocol is presented in which GPT-4, an AI-powered large language model, can be used to rapidly screen and identify block copolymer papers from the literature using only the abstract text and determine whether they have BCDB data, allowing the database to grow as the number of published papers on the World Wide Web increases. The F1 score for this model is 0.74. This platform is an important step in making polymer data more accessible to the broader community.
嵌段共聚物数据库 (BCDB) 是一个平台,允许用户搜索、提交、可视化、基准测试和下载二嵌段和多嵌段共聚物的实验相测量值及其相关特征信息。据我们所知,目前还没有广泛接受的发布嵌段共聚物自组装实验和模拟数据的数据模型。该数据方案提出了具有可追溯信息的模型,可以容纳任意数量的嵌段,并且在发布时包含了超过 5400 个从文献中挖掘并经过手动整理的嵌段共聚物总熔体相测量值,以及来自自洽场理论的相图模拟数据点,这些数据点可以快速增加。该数据库可以通过社区创新聚合物技术资源 (CRIPT) 网络应用程序和材料数据设施进行访问。聚合物的化学结构以 BigSMILES 的形式进行编码,这是对简化分子输入行式系统 (SMILES) 在大分子领域的扩展,用户可以使用 SMARTS 搜索语法 (SMILES 任意目标规范) 搜索重复单元和功能组。用户还可以使用结构化查询语言 (SQL) 查询特征和相信息,并下载定制的嵌段共聚物数据集,以训练机器学习模型。最后,提出了一个方案,即使用人工智能语言模型 GPT-4 仅使用摘要文本快速筛选和识别文献中的嵌段共聚物论文,并确定它们是否具有 BCDB 数据,从而使数据库随着互联网上发表的论文数量的增加而不断增长。该模型的 F1 得分为 0.74。该平台是使聚合物数据更广泛地为更广泛的社区所接受的重要一步。