Banerjee Anjishnu, DeVogel Nicholas, Pintar Frank A, Yoganandan Narayan
a Division of Biostatistics , Medical College of Wisconsin , Milwaukee , Wisconsin.
b Department of Neurosurgery , Medical College of Wisconsin , Milwaukee , Wisconsin.
Traffic Inj Prev. 2018;19(sup2):S121-S126. doi: 10.1080/15389588.2018.1519805. Epub 2018 Dec 20.
The objective of this study is to present a novel framework, termed the knockoff technique, to evaluate different metric ranking algorithms to better describe human response to injury.
Many biomechanical metrics are routinely obtained from impact tests using postmortem human surrogates (PMHS) to develop injury risk curves (IRCs). The IRCs form the basis to evaluate human safety in crashworthiness environments. The biomechanical metrics should be chosen based on some measure of their predictive ability. Commonly used algorithms for the choice of ranking the metrics include (a) areas under the receiver operating characteristic curve (AUROC), time-varying AUROC, and other adaptations, and (b) some variants of predictive squared error loss. This article develops a rigorous framework to evaluate the metric selection/ranking algorithms. Actual experimental data are used due to the shortcoming of using simulated data. The knockoff data are meshed into existing experimental data using advanced statistical algorithms. Error rate measures such as false discovery rates (FDRs) and bias are calculated using the knockoff technique. Experimental data are used from previously published whole-body PMHS side impact sled tests. The experiments were conducted at different velocities, padding and rigid load wall conditions, and offsets and with different supplemental restraint systems. The PMHS specimens were subjected to a single lateral impact loading resulting in injury and noninjury outcomes.
A total of 25 metrics were used from 42 tests. The AUROC-type algorithms tended to have higher FDRs compared to the squared error loss-type functions (45.3% for the best AUROC-type algorithms versus 31.4% for the best Brier score algorithm). Standard errors for the Brier score algorithm also tended to be lower, indicative of more stable metric choices and robust rankings. The wide variations observed in the performance of the algorithms demonstrated the need for data set-specific evaluation tools such as the knockoff technique developed in this study.
In the present data set, the AUROCs and related binary classification algorithms led to inflated FDRs, rendering metric selection/ranking questionable. This is particularly true for data sets with a high proportion of censoring. Squared error loss-type algorithms (such as the Brier score algorithm or its modifications) improved the performance in the metric selection process. The presented new knockoff technique may wholly change how IRCs are developed from impact experiments or simulations. At the very least, the knockoff technique demonstrates the need for evaluations among different metric ranking/selection algorithms, especially when they produce substantially different biomechanical metric choices. Without recommending the AUROC-type or Brier score-type algorithms universally, the authors suggest careful assessments of these algorithms using the proposed framework, so that a robust algorithm may be chosen, with respect to the nature of the experimental data set. Though results are given for sets from a published series of experiments, the knockoff technique is being used by the authors in tests that are applicable to the automotive, aviation, military, and other environments.
本研究的目的是提出一种名为仿冒技术的新型框架,以评估不同的度量排序算法,从而更好地描述人类对损伤的反应。
许多生物力学指标通常是通过使用尸体人类 surrogate(PMHS)进行撞击试验来获取的,以建立损伤风险曲线(IRC)。IRC 构成了评估耐撞性环境中人类安全性的基础。生物力学指标应根据其预测能力的某种度量来选择。用于选择指标排序的常用算法包括:(a)接收器操作特征曲线(AUROC)下的面积、随时间变化的 AUROC 以及其他变体,以及(b)预测平方误差损失的一些变体。本文开发了一个严格的框架来评估指标选择/排序算法。由于使用模拟数据存在缺陷,因此使用了实际实验数据。利用先进的统计算法将仿冒数据与现有的实验数据进行网格化处理。使用仿冒技术计算错误率度量,如错误发现率(FDR)和偏差。实验数据来自先前发表的全身 PMHS 侧面撞击雪橇试验。实验在不同速度、衬垫和刚性加载壁条件、偏移量以及不同的辅助约束系统下进行。PMHS 标本受到单次横向撞击载荷,导致受伤和未受伤的结果。
从 42 次试验中总共使用了 25 个指标。与平方误差损失型函数相比,AUROC 型算法往往具有更高的 FDR(最佳 AUROC 型算法为 45.3%,而最佳 Brier 评分算法为 31.4%)。Brier 评分算法的标准误差也往往较低,这表明指标选择更稳定且排名更稳健。算法性能中观察到的广泛差异表明需要特定于数据集的评估工具,如本研究中开发的仿冒技术。
在当前数据集中,AUROC 及其相关的二元分类算法导致 FDR 虚高,使得指标选择/排序存在疑问。对于审查比例较高的数据集尤其如此。平方误差损失型算法(如 Brier 评分算法或其修改版本)在指标选择过程中提高了性能。所提出的新仿冒技术可能会彻底改变从撞击实验或模拟中开发 IRC 的方式。至少,仿冒技术表明需要在不同的指标排名/选择算法之间进行评估,特别是当它们产生实质上不同 的生物力学指标选择时。作者并不普遍推荐 AUROC 型或 Brier 评分型算法,而是建议使用所提出的框架对这些算法进行仔细评估,以便根据实验数据集的性质选择一种稳健的算法。尽管给出了来自已发表系列实验数据集的结果,但作者正在将仿冒技术用于适用于汽车、航空、军事和其他环境的测试中。