University Duisburg-Essen, Faculty of Medicine, IMIBE, Essen, Germany.
Stud Health Technol Inform. 2023 Jun 29;305:24-27. doi: 10.3233/SHTI230414.
Although data quality is well defined, the relationship to data quantity remains unclear. Especially the big data approach promises advantages of volume in comparison with small samples in good quality. Aim of this study was to review this issue. Based on the experiences with six registries within a German funding initiative, the definition of data quality provided by the International Organization for Standardization (ISO) was confronted with several aspects of data quantity. The results of a literature search combining both concepts were considered additionally. Data quantity was identified as an umbrella of some inherent characteristics of data like case and data completeness. The same time, quantity could be regarded as a non inherent characteristic of data beyond the ISO standard focusing on the breadth and depth of metadata, i.e. data elements along with their value sets. The FAIR Guiding Principles take into account the latter solely. Surprisingly, the literature agreed in demanding an increase in data quality with volume, turning the big data approach inside out. A usage of data without context - as it could be the case in data mining or machine learning - is neither covered by the concept of data quality nor of data quantity.
尽管数据质量的定义已经很明确,但它与数据数量的关系仍不清楚。特别是大数据方法相对于高质量的小样本来说,在数量上具有优势。本研究旨在探讨这个问题。基于德国资助倡议下六个登记处的经验,本文将国际标准化组织(ISO)提供的数据质量定义与数据数量的几个方面进行了对比。此外,还考虑了将这两个概念结合起来的文献检索结果。数据数量被确定为数据某些固有特征(如案例和数据完整性)的总称。同时,数量也可以被视为超出 ISO 标准范围的数据的非固有特征,该标准侧重于元数据的广度和深度,即数据元素及其值集。FAIR 指导原则仅考虑后者。令人惊讶的是,文献在要求随着数据量的增加而提高数据质量方面达成了一致,这使得大数据方法变得颠倒了。在没有上下文的情况下使用数据——例如在数据挖掘或机器学习中——既不受数据质量概念的涵盖,也不受数据数量概念的涵盖。