Harvey Matthew J, Mason Nicholas J, McLean Andrew, Murray-Rust Peter, Rzepa Henry S, Stewart James J P
High Performance Computing Service, Imperial College London, London, SW7 2AZ UK.
Department of Chemistry, Imperial College London, South Kensington Campus, London, SW7 2AZ UK.
J Cheminform. 2015 Aug 27;7:43. doi: 10.1186/s13321-015-0093-3. eCollection 2015.
The desirable curation of 158,122 molecular geometries derived from the NCI set of reference molecules together with associated properties computed using the MOPAC semi-empirical quantum mechanical method and originally deposited in 2005 into the Cambridge DSpace repository as a data collection is reported.
The procedures involved in the curation included annotation of the original data using new MOPAC methods, updating the syntax of the CML documents used to express the data to ensure schema conformance and adding new metadata describing the entries together with a XML schema transformation to map the metadata schema to that used by the DataCite organisation. We have adopted a granularity model in which a DataCite persistent identifier (DOI) is created for each individual molecule to enable data discovery and data metrics at this level using DataCite tools.
We recommend that the future research data management (RDM) of the scientific and chemical data components associated with journal articles (the "supporting information") should be conducted in a manner that facilitates automatic periodic curation. Graphical abstractStandards and metadata-based curation of a decade-old digital repository dataset of molecular information.
报告了对源自美国国立癌症研究所(NCI)参考分子集的158,122种分子几何结构进行的理想整理,以及使用MOPAC半经验量子力学方法计算的相关属性,这些数据最初于2005年作为一个数据集存入剑桥DSpace知识库。
整理过程涉及的步骤包括使用新的MOPAC方法对原始数据进行注释,更新用于表达数据的CML文档的语法以确保符合模式,并添加描述条目的新元数据以及进行XML模式转换,以将元数据模式映射为DataCite组织使用的模式。我们采用了一种粒度模型,为每个单独的分子创建一个DataCite持久标识符(DOI),以便使用DataCite工具在此级别进行数据发现和数据计量。
我们建议,与期刊文章相关的科学和化学数据组件(“支持信息”)的未来研究数据管理(RDM)应以促进自动定期整理的方式进行。图形摘要基于标准和元数据对一个有十年历史的分子信息数字存储库数据集进行整理。