Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, D-53113 Bonn, Germany.
Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, D-53113 Bonn, Germany.
Drug Discov Today. 2018 Jun;23(6):1183-1186. doi: 10.1016/j.drudis.2018.03.005. Epub 2018 Mar 17.
Public repositories of compounds and activity data are of prime importance for pharmaceutical research in academic and industrial settings. Major databases have evolved over the years. Their growth is accompanied by an increasing tendency toward data sharing. This is a positive development but not without potential problems. Using ChEMBL and PubChem as examples, we show that crosstalk between databases also leads to substantial data redundancy that might not be obvious. Redundancy is an important issue because it biases data analysis and knowledge extraction and leads to inflated views of available compounds, assays and activity data. Going forward it will be important to further refine data exchange and deposition criteria and make redundancy as transparent as possible.
化合物和活性数据的公共存储库对于学术和工业环境中的药物研究至关重要。主要数据库多年来一直在发展。随着数据库的发展,数据共享的趋势也越来越明显。这是一个积极的发展,但并非没有潜在问题。我们以 ChEMBL 和 PubChem 为例,表明数据库之间的交叉也会导致大量数据冗余,而这些冗余可能并不明显。冗余是一个重要的问题,因为它会影响数据分析和知识提取,并导致对可用化合物、测定和活性数据的高估。未来,进一步完善数据交换和存储标准,并尽可能透明地处理冗余问题将非常重要。