Novadiscovery, 1 Place Giovanni Da Verrazzano, 69009, Lyon, France.
BMC Bioinformatics. 2023 Sep 4;24(1):331. doi: 10.1186/s12859-023-05430-w.
Over the past several decades, metrics have been defined to assess the quality of various types of models and to compare their performance depending on their capacity to explain the variance found in real-life data. However, available validation methods are mostly designed for statistical regressions rather than for mechanistic models. To our knowledge, in the latter case, there are no consensus standards, for instance for the validation of predictions against real-world data given the variability and uncertainty of the data. In this work, we focus on the prediction of time-to-event curves using as an application example a mechanistic model of non-small cell lung cancer. We designed four empirical methods to assess both model performance and reliability of predictions: two methods based on bootstrapped versions of parametric statistical tests: log-rank and combined weighted log-ranks (MaxCombo); and two methods based on bootstrapped prediction intervals, referred to here as raw coverage and the juncture metric. We also introduced the notion of observation time uncertainty to take into consideration the real life delay between the moment when an event happens, and the moment when it is observed and reported.
We highlight the advantages and disadvantages of these methods according to their application context. We have shown that the context of use of the model has an impact on the model validation process. Thanks to the use of several validation metrics we have highlighted the limit of the model to predict the evolution of the disease in the whole population of mutations at the same time, and that it was more efficient with specific predictions in the target mutation populations. The choice and use of a single metric could have led to an erroneous validation of the model and its context of use.
With this work, we stress the importance of making judicious choices for a metric, and how using a combination of metrics could be more relevant, with the objective of validating a given model and its predictions within a specific context of use. We also show how the reliability of the results depends both on the metric and on the statistical comparisons, and that the conditions of application and the type of available information need to be taken into account to choose the best validation strategy.
在过去几十年中,已经定义了各种指标来评估各种类型模型的质量,并根据其解释实际数据中发现的方差的能力来比较它们的性能。然而,现有的验证方法大多是为统计回归而设计的,而不是为机械模型设计的。据我们所知,在后一种情况下,对于给定数据的变异性和不确定性的情况下,针对真实世界数据的预测的验证,还没有共识标准。在这项工作中,我们专注于使用非小细胞肺癌的机械模型来预测事件时间曲线。我们设计了四种经验方法来评估模型性能和预测的可靠性:两种基于参数统计检验的自举版本的方法:对数秩和组合加权对数秩(MaxCombo);以及两种基于自举预测区间的方法,这里称为原始覆盖率和连接度量。我们还引入了观察时间不确定性的概念,以考虑事件发生和观察报告之间的实际时间延迟。
根据其应用背景,我们强调了这些方法的优缺点。我们已经表明,模型的使用上下文会对模型验证过程产生影响。通过使用多个验证指标,我们强调了模型同时预测整个突变群体疾病演变的能力的局限性,并且在特定的靶突变群体的预测中,模型更有效。选择和使用单个指标可能会导致对模型及其使用上下文的错误验证。
通过这项工作,我们强调了明智选择指标的重要性,以及如何使用组合指标可能更相关,目的是在特定的使用上下文中验证给定的模型及其预测。我们还表明,结果的可靠性既取决于指标,也取决于统计比较,并且需要考虑应用条件和可用信息的类型,以选择最佳的验证策略。