Heinze Georg, Baillie Mark, Lusa Lara, Sauerbrei Willi, Schmidt Carsten Oliver, Harrell Frank E, Huebner Marianne
Center for Medical Data Science, Institute of Clinical Biometrics, Medical University of Vienna, Spitalgasse 23, 1090, Vienna, Austria.
Novartis Pharma AG, Basel, Switzerland.
BMC Med Res Methodol. 2024 Aug 8;24(1):178. doi: 10.1186/s12874-024-02294-3.
Statistical regression models are used for predicting outcomes based on the values of some predictor variables or for describing the association of an outcome with predictors. With a data set at hand, a regression model can be easily fit with standard software packages. This bears the risk that data analysts may rush to perform sophisticated analyses without sufficient knowledge of basic properties, associations in and errors of their data, leading to wrong interpretation and presentation of the modeling results that lacks clarity. Ignorance about special features of the data such as redundancies or particular distributions may even invalidate the chosen analysis strategy. Initial data analysis (IDA) is prerequisite to regression analyses as it provides knowledge about the data needed to confirm the appropriateness of or to refine a chosen model building strategy, to interpret the modeling results correctly, and to guide the presentation of modeling results. In order to facilitate reproducibility, IDA needs to be preplanned, an IDA plan should be included in the general statistical analysis plan of a research project, and results should be well documented. Biased statistical inference of the final regression model can be minimized if IDA abstains from evaluating associations of outcome and predictors, a key principle of IDA. We give advice on which aspects to consider in an IDA plan for data screening in the context of regression modeling to supplement the statistical analysis plan. We illustrate this IDA plan for data screening in an example of a typical diagnostic modeling project and give recommendations for data visualizations.
统计回归模型用于根据一些预测变量的值预测结果,或描述结果与预测变量之间的关联。手头有数据集时,使用标准软件包可以轻松拟合回归模型。这带来了一种风险,即数据分析人员可能在对其数据的基本属性、关联和误差没有足够了解的情况下急于进行复杂分析,从而导致对建模结果的错误解释和呈现,缺乏清晰度。对数据的特殊特征(如冗余或特定分布)的忽视甚至可能使所选的分析策略无效。初始数据分析(IDA)是回归分析的前提条件,因为它提供了确认所选模型构建策略的适当性或对其进行完善、正确解释建模结果以及指导建模结果呈现所需的数据知识。为了便于重现,IDA需要预先规划,IDA计划应包含在研究项目的总体统计分析计划中,并且结果应妥善记录。如果IDA避免评估结果与预测变量之间的关联(IDA的一个关键原则),则可以将最终回归模型的有偏统计推断降至最低。我们就回归建模背景下数据筛选的IDA计划中应考虑哪些方面提供建议,以补充统计分析计划。我们在一个典型诊断建模项目的示例中说明了此数据筛选的IDA计划,并给出数据可视化的建议。