Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School, Carl-Neuberg-Str. 1, 30625, Hannover, Germany.
BMC Med Inform Decis Mak. 2021 Nov 1;21(1):302. doi: 10.1186/s12911-021-01656-x.
Data quality assessment is important but complex and task dependent. Identifying suitable measurement methods and reference ranges for assessing their results is challenging. Manually inspecting the measurement results and current data driven approaches for learning which results indicate data quality issues have considerable limitations, e.g. to identify task dependent thresholds for measurement results that indicate data quality issues.
To explore the applicability and potential benefits of a data driven approach to learn task dependent knowledge about suitable measurement methods and assessment of their results. Such knowledge could be useful for others to determine whether a local data stock is suitable for a given task.
We started by creating artificial data with previously defined data quality issues and applied a set of generic measurement methods on this data (e.g. a method to count the number of values in a certain variable or the mean value of the values). We trained decision trees on exported measurement methods' results and corresponding outcome data (data that indicated the data's suitability for a use case). For evaluation, we derived rules for potential measurement methods and reference values from the decision trees and compared these regarding their coverage of the true data quality issues artificially created in the dataset. Three researchers independently derived these rules. One with knowledge about present data quality issues and two without.
Our self-trained decision trees were able to indicate rules for 12 of 19 previously defined data quality issues. Learned knowledge about measurement methods and their assessment was complementary to manual interpretation of measurement methods' results.
Our data driven approach derives sensible knowledge for task dependent data quality assessment and complements other current approaches. Based on labeled measurement methods' results as training data, our approach successfully suggested applicable rules for checking data quality characteristics that determine whether a dataset is suitable for a given task.
数据质量评估非常重要,但复杂且依赖任务。确定适合评估其结果的测量方法和参考范围具有挑战性。手动检查测量结果和当前的数据驱动方法来学习哪些结果表示数据质量问题具有很大的局限性,例如,确定指示数据质量问题的测量结果的任务相关阈值。
探索数据驱动方法在学习适合特定任务的测量方法和评估其结果的任务相关知识方面的适用性和潜在优势。这些知识可用于帮助其他人确定本地数据是否适用于给定任务。
我们首先使用先前定义的数据质量问题创建人工数据,并在该数据上应用一组通用的测量方法(例如,一种用于计算特定变量中值的数量或值的平均值的方法)。我们根据导出的测量方法的结果和相应的结果数据(表示数据是否适合用例的数据)对决策树进行了训练。为了评估,我们从决策树中得出了潜在测量方法和参考值的规则,并比较了这些规则在覆盖数据集中原先创建的人为数据质量问题方面的覆盖程度。三位研究人员分别独立得出了这些规则。其中一位研究人员具有当前数据质量问题的知识,另外两位没有。
我们的自训练决策树能够指示 19 个先前定义的数据质量问题中的 12 个问题的规则。关于测量方法及其评估的学习知识补充了对测量方法结果的手动解释。
我们的数据驱动方法为依赖任务的数据质量评估提供了合理的知识,并补充了其他当前方法。基于标记的测量方法的结果作为训练数据,我们的方法成功地为检查数据质量特征提供了适用的规则,这些特征决定了数据集是否适合给定任务。