Sanders Catherine M, Saltzstein Sidney L, Schultzel Matthew M, Nguyen Duy H, Stafford Helen Shi, Sadler Georgia Robins
Rebecca and John Moores UCSD Cancer Center, University of California San Diego, La Jolla, CA 92093-0850, USA.
J Cancer Educ. 2012 Dec;27(4):664-9. doi: 10.1007/s13187-012-0383-7.
Many health professionals use large datasets to answer behavioral, translational, or clinical questions. Understanding the impact of missing data in large databases, such as disease registries, can avoid erroneous interpretations of these data. Using the California Cancer Registry, the authors selected seven common cancers, seven sociodemographic and clinical variables, and the top three reporting sources, as examples of the type of data that would be deemed critical to most studies. The gender variable had no missing data, followed by age (<0.1 % missing), ethnicity (1.7 %), stage (9.8 %), differentiation (39.1 %), and birthplace (41.1 %). Reports from hospitals and clinics had the lowest percentages of missing data. Users of large datasets should anticipate the limitations of missing data to prevent methodological flaws and misinterpretations of research findings. Knowledge of what and how much data may be missing in large datasets can help prevent errors in research conclusions, while better guiding treatment modalities and public health policies and programs.
许多医疗专业人员使用大型数据集来回答行为、转化或临床问题。了解大型数据库(如疾病登记系统)中缺失数据的影响,可避免对这些数据的错误解读。作者以加利福尼亚癌症登记系统为例,选取了七种常见癌症、七个社会人口统计学和临床变量以及三大报告来源,作为对大多数研究至关重要的数据类型示例。性别变量无缺失数据,其次是年龄(缺失率<0.1%)、种族(1.7%)、分期(9.8%)、分化程度(39.1%)和出生地(41.1%)。医院和诊所的报告缺失数据比例最低。大型数据集的用户应预见到缺失数据的局限性,以防止方法上的缺陷和对研究结果的错误解读。了解大型数据集中可能缺失哪些数据以及缺失多少数据,有助于防止研究结论出现错误,同时更好地指导治疗方式以及公共卫生政策和项目。