Huang Ying, Dasgupta Sayan
Vaccine & Infectious Disease Division, Fred Hutchinson Cancer Center, US.
N Engl J Stat Data Sci. 2024 Apr;2(1):3-14. doi: 10.51387/24-nejsds59. Epub 2024 Jan 31.
We consider the problem of developing flexible and parsimonious biomarker combinations for cancer early detection in the presence of variable missingness at random. Motivated by the need to develop biomarker panels in a cross-institute pancreatic cyst biomarker validation study, we propose logic-regression based methods for feature selection and construction of logic rules under a multiple imputation framework. We generate ensemble trees for classification decision, and further select a single decision tree for simplicity and interpretability. We demonstrate superior performance of the proposed methods compared to alternative methods based on complete-case data or single imputation. The methods are applied to the pancreatic cyst data to estimate biomarker panels for pancreatic cysts subtype classification and malignant potential prediction.
我们考虑在存在随机缺失值的情况下,开发灵活且简约的生物标志物组合用于癌症早期检测的问题。受跨机构胰腺囊肿生物标志物验证研究中开发生物标志物组的需求驱动,我们提出了基于逻辑回归的方法,用于在多重插补框架下进行特征选择和逻辑规则构建。我们生成用于分类决策的集成树,并进一步选择单个决策树以实现简单性和可解释性。与基于完整病例数据或单一插补的替代方法相比,我们证明了所提出方法的卓越性能。这些方法应用于胰腺囊肿数据,以估计用于胰腺囊肿亚型分类和恶性潜能预测的生物标志物组。