Gumbsch Thomas, Bock Christian, Moor Michael, Rieck Bastian, Borgwardt Karsten
Department of Biosystems Science and Engineering, ETH Zurich, Basel 4058, Switzerland.
SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland.
Bioinformatics. 2020 Dec 30;36(Suppl_2):i840-i848. doi: 10.1093/bioinformatics/btaa815.
Temporal biomarker discovery in longitudinal data is based on detecting reoccurring trajectories, the so-called shapelets. The search for shapelets requires considering all subsequences in the data. While the accompanying issue of multiple testing has been mitigated in previous work, the redundancy and overlap of the detected shapelets results in an a priori unbounded number of highly similar and structurally meaningless shapelets. As a consequence, current temporal biomarker discovery methods are impractical and underpowered.
We find that the pre- or post-processing of shapelets does not sufficiently increase the power and practical utility. Consequently, we present a novel method for temporal biomarker discovery: Statistically Significant Submodular Subset Shapelet Mining (S5M) that retrieves short subsequences that are (i) occurring in the data, (ii) are statistically significantly associated with the phenotype and (iii) are of manageable quantity while maximizing structural diversity. Structural diversity is achieved by pruning non-representative shapelets via submodular optimization. This increases the statistical power and utility of S5M compared to state-of-the-art approaches on simulated and real-world datasets. For patients admitted to the intensive care unit (ICU) showing signs of severe organ failure, we find temporal patterns in the sequential organ failure assessment score that are associated with in-ICU mortality.
S5M is an option in the python package of S3M: github.com/BorgwardtLab/S3M.
纵向数据中的时间生物标志物发现基于检测反复出现的轨迹,即所谓的形状let。寻找形状let需要考虑数据中的所有子序列。虽然在先前的工作中多重检验的相关问题已得到缓解,但检测到的形状let的冗余和重叠导致了大量高度相似且结构无意义的形状let,其数量在事前是无界的。因此,当前的时间生物标志物发现方法不切实际且效能不足。
我们发现对形状let进行预处理或后处理并不能充分提高效能和实际效用。因此,我们提出了一种用于时间生物标志物发现的新方法:具有统计显著性的次模子集形状let挖掘(S5M),该方法可检索出满足以下条件的短子序列:(i)在数据中出现,(ii)与表型具有统计显著性关联,(iii)数量可控,同时使结构多样性最大化。通过次模优化修剪非代表性形状let来实现结构多样性。与模拟数据集和真实世界数据集上的现有方法相比,这提高了S5M的统计效能和实用性。对于入住重症监护病房(ICU)且出现严重器官衰竭迹象的患者,我们在序贯器官衰竭评估评分中发现了与ICU内死亡率相关的时间模式。
S5M是S3M的Python包中的一个选项:github.com/BorgwardtLab/S3M 。