The Roslin Institute, The University of Edinburgh, Easter Bush Campus, Midlothian, Edinburgh, United Kingdom.
The Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush Campus, Midlothian, Edinburgh, United Kingdom.
PLoS One. 2020 Jan 24;15(1):e0228154. doi: 10.1371/journal.pone.0228154. eCollection 2020.
All data are prone to error and require data cleaning prior to analysis. An important example is longitudinal growth data, for which there are no universally agreed standard methods for identifying and removing implausible values and many existing methods have limitations that restrict their usage across different domains. A decision-making algorithm that modified or deleted growth measurements based on a combination of pre-defined cut-offs and logic rules was designed. Five data cleaning methods for growth were tested with and without the addition of the algorithm and applied to five different longitudinal growth datasets: four uncleaned canine weight or height datasets and one pre-cleaned human weight dataset with randomly simulated errors. Prior to the addition of the algorithm, data cleaning based on non-linear mixed effects models was the most effective in all datasets and had on average a minimum of 26.00% higher sensitivity and 0.12% higher specificity than other methods. Data cleaning methods using the algorithm had improved data preservation and were capable of correcting simulated errors according to the gold standard; returning a value to its original state prior to error simulation. The algorithm improved the performance of all data cleaning methods and increased the average sensitivity and specificity of the non-linear mixed effects model method by 7.68% and 0.42% respectively. Using non-linear mixed effects models combined with the algorithm to clean data allows individual growth trajectories to vary from the population by using repeated longitudinal measurements, identifies consecutive errors or those within the first data entry, avoids the requirement for a minimum number of data entries, preserves data where possible by correcting errors rather than deleting them and removes duplications intelligently. This algorithm is broadly applicable to data cleaning anthropometric data in different mammalian species and could be adapted for use in a range of other domains.
所有数据都容易出错,在进行分析之前需要进行数据清理。一个重要的例子是纵向生长数据,对于这种数据,目前还没有普遍认可的标准方法来识别和删除不合理的值,并且许多现有的方法都存在限制,限制了它们在不同领域的使用。设计了一种决策算法,该算法根据预定义的截止值和逻辑规则组合来修改或删除生长测量值。使用和不使用该算法测试了五种用于生长数据的清理方法,并将其应用于五个不同的纵向生长数据集:四个未清理的犬体重或身高数据集和一个带有随机模拟错误的已清理的人类体重数据集。在添加算法之前,基于非线性混合效应模型的数据清理在所有数据集都是最有效的,并且平均比其他方法具有至少 26.00%更高的灵敏度和 0.12%更高的特异性。使用算法的数据清理方法提高了数据的保留率,并能够根据黄金标准纠正模拟错误;将值返回到错误模拟之前的原始状态。该算法提高了所有数据清理方法的性能,并将非线性混合效应模型方法的平均灵敏度和特异性分别提高了 7.68%和 0.42%。使用非线性混合效应模型结合算法来清理数据,可以允许个体生长轨迹通过使用重复的纵向测量值从人群中有所不同,识别连续的错误或第一个数据输入中的错误,避免了对最小数据量的要求,尽可能通过纠正错误而不是删除它们来保留数据,并智能地删除重复项。该算法广泛适用于不同哺乳动物物种的人体测量数据清理,并且可以适应于其他领域的使用。