Akhtar-Danesh Noori, Dehghan-Kooshkghazi Mahshid
School of Nursing & Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Canada.
BMC Med Res Methodol. 2003 Sep 29;3:18. doi: 10.1186/1471-2288-3-18.
Misconduct in medical research has been the subject of many papers in recent years. Among different types of misconduct, data fabrication might be considered as one of the most severe cases. There have been some arguments that correlation coefficients in fabricated data-sets are usually greater than that found in real data-sets. We aim to study the differences between real and fabricated data-sets in term of the association between two variables.
Three examples are presented where outcomes from made up (fabricated) data-sets are compared with the results from three real data-sets and with appropriate simulated data-sets. Data-sets were made up by faculty members in three universities. The first two examples are devoted to the correlation structures between continuous variables in two different settings: first, when there is high correlation coefficient between variables, second, when the variables are not correlated. In the third example the differences between real data-set and fabricated data-sets are studied using the independent t-test for comparison between two means.
In general, higher correlation coefficients are seen in made up data-sets compared to the real data-sets. This occurs even when the participants are aware that the correlation coefficient for the corresponding real data-set is zero. The findings from the third example, a comparison between means in two groups, shows that many people tend to make up data with less or no differences between groups even when they know how and to what extent the groups are different.
This study indicates that high correlation coefficients can be considered as a leading sign of data fabrication; as more than 40% of the participants generated variables with correlation coefficients greater than 0.70. However, when inspecting for the differences between means in different groups, the same rule may not be applicable as we observed smaller differences between groups in made up compared to the real data-set. We also showed that inspecting the scatter-plot of two variables can be considered as a useful tool for uncovering fabricated data.
近年来,医学研究中的不当行为一直是众多论文的主题。在不同类型的不当行为中,数据造假可能被视为最严重的情况之一。有一些观点认为,伪造数据集中的相关系数通常大于真实数据集中的相关系数。我们旨在研究真实数据集和伪造数据集在两个变量之间关联方面的差异。
给出了三个例子,将编造(伪造)数据集的结果与三个真实数据集以及适当的模拟数据集的结果进行比较。数据集由三所大学的教员编造。前两个例子致力于研究两种不同情况下连续变量之间的相关结构:第一,变量之间存在高相关系数时;第二,变量不相关时。在第三个例子中,使用独立t检验研究真实数据集和伪造数据集之间的差异,以比较两个均值。
一般来说,与真实数据集相比,伪造数据集中的相关系数更高。即使参与者知道相应真实数据集的相关系数为零,这种情况也会发生。第三个例子中两组均值比较的结果表明,许多人即使知道两组如何不同以及在何种程度上不同,仍倾向于编造组间差异较小或无差异的数据。
本研究表明,高相关系数可被视为数据造假的一个主要迹象;因为超过40%的参与者生成的变量相关系数大于0.70。然而,在检查不同组均值之间的差异时,同样的规则可能不适用,因为我们观察到与真实数据集相比,编造数据集中组间差异更小。我们还表明,检查两个变量的散点图可被视为发现伪造数据的有用工具。