Stanford Center for Biomedical Informatics Research, Stanford University, Stanford CA, USA.
Sci Data. 2019 Feb 19;6:190021. doi: 10.1038/sdata.2019.21.
We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well-known databases: BioSample-a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples-a repository managed by the European Bioinformatics Institute (EBI). We tested whether 11.4 M sample metadata records in the two repositories are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the metadata. Most metadata field names and their values are not standardized or controlled. Even simple binary or numeric fields are often populated with inadequate values of different data types. By clustering metadata field names, we discovered there are often many distinct ways to represent the same aspect of a sample. Overall, the metadata we analyzed reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The significant aberrancies that we found in the metadata are likely to impede search and secondary use of the associated datasets.
我们对生物医学实验中使用的样本相关元数据的质量进行了分析研究。分析中的元数据存储在两个著名的数据库中:由美国国家生物技术信息中心(NCBI)管理的 BioSample 存储库,以及由欧洲生物信息学研究所(EBI)管理的 BioSamples 存储库。我们测试了这两个存储库中 1140 万条样本元数据记录是否使用满足这些值的规定要求的值进行了填充。我们的研究揭示了元数据中的多个异常。大多数元数据字段名称及其值没有标准化或受到控制。即使是简单的二进制或数字字段,也经常使用不同数据类型的不适当值进行填充。通过对元数据字段名称进行聚类,我们发现通常有许多不同的方法可以表示样本的同一方面。总的来说,我们分析的元数据表明,缺乏执行和验证元数据要求的原则性机制。我们在元数据中发现的显著异常情况可能会阻碍相关数据集的搜索和二次使用。