通过代表性形状子挖掘提高时间生物标志物发现中的统计功效。

Enhancing statistical power in temporal biomarker discovery through representative shapelet mining.

作者信息

Gumbsch Thomas, Bock Christian, Moor Michael, Rieck Bastian, Borgwardt Karsten

机构信息

Department of Biosystems Science and Engineering, ETH Zurich, Basel 4058, Switzerland.

SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland.

出版信息

Bioinformatics. 2020 Dec 30;36(Suppl_2):i840-i848. doi: 10.1093/bioinformatics/btaa815.

DOI:10.1093/bioinformatics/btaa815

PMID:33381811

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7773478/

Abstract

MOTIVATION

Temporal biomarker discovery in longitudinal data is based on detecting reoccurring trajectories, the so-called shapelets. The search for shapelets requires considering all subsequences in the data. While the accompanying issue of multiple testing has been mitigated in previous work, the redundancy and overlap of the detected shapelets results in an a priori unbounded number of highly similar and structurally meaningless shapelets. As a consequence, current temporal biomarker discovery methods are impractical and underpowered.

RESULTS

We find that the pre- or post-processing of shapelets does not sufficiently increase the power and practical utility. Consequently, we present a novel method for temporal biomarker discovery: Statistically Significant Submodular Subset Shapelet Mining (S5M) that retrieves short subsequences that are (i) occurring in the data, (ii) are statistically significantly associated with the phenotype and (iii) are of manageable quantity while maximizing structural diversity. Structural diversity is achieved by pruning non-representative shapelets via submodular optimization. This increases the statistical power and utility of S5M compared to state-of-the-art approaches on simulated and real-world datasets. For patients admitted to the intensive care unit (ICU) showing signs of severe organ failure, we find temporal patterns in the sequential organ failure assessment score that are associated with in-ICU mortality.

AVAILABILITY AND IMPLEMENTATION

S5M is an option in the python package of S3M: github.com/BorgwardtLab/S3M.

摘要

动机

纵向数据中的时间生物标志物发现基于检测反复出现的轨迹，即所谓的形状let。寻找形状let需要考虑数据中的所有子序列。虽然在先前的工作中多重检验的相关问题已得到缓解，但检测到的形状let的冗余和重叠导致了大量高度相似且结构无意义的形状let，其数量在事前是无界的。因此，当前的时间生物标志物发现方法不切实际且效能不足。

结果

我们发现对形状let进行预处理或后处理并不能充分提高效能和实际效用。因此，我们提出了一种用于时间生物标志物发现的新方法：具有统计显著性的次模子集形状let挖掘（S5M），该方法可检索出满足以下条件的短子序列：（i）在数据中出现，（ii）与表型具有统计显著性关联，（iii）数量可控，同时使结构多样性最大化。通过次模优化修剪非代表性形状let来实现结构多样性。与模拟数据集和真实世界数据集上的现有方法相比，这提高了S5M的统计效能和实用性。对于入住重症监护病房（ICU）且出现严重器官衰竭迹象的患者，我们在序贯器官衰竭评估评分中发现了与ICU内死亡率相关的时间模式。

可用性和实现方式

S5M是S3M的Python包中的一个选项：github.com/BorgwardtLab/S3M 。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a15/7773478/54c4593743d7/btaa815f7.jpg

相似文献

Enhancing statistical power in temporal biomarker discovery through representative shapelet mining.

Bioinformatics. 2020 Dec 30;36(Suppl_2):i840-i848. doi: 10.1093/bioinformatics/btaa815.

Association mapping in biomedical time series via statistically significant shapelet mining.

Bioinformatics. 2018 Jul 1;34(13):i438-i446. doi: 10.1093/bioinformatics/bty246.

Salient Subsequence Learning for Time Series Clustering.

IEEE Trans Pattern Anal Mach Intell. 2019 Sep;41(9):2193-2207. doi: 10.1109/TPAMI.2018.2847699. Epub 2018 Jun 15.

GENDIS: Genetic Discovery of Shapelets.

Sensors (Basel). 2021 Feb 4;21(4):1059. doi: 10.3390/s21041059.

W-TSS: A Wavelet-Based Algorithm for Discovering Time Series Shapelets.

Sensors (Basel). 2021 Aug 28;21(17):5801. doi: 10.3390/s21175801.

Multiview Unsupervised Shapelet Learning for Multivariate Time Series Clustering.

IEEE Trans Pattern Anal Mach Intell. 2023 Apr;45(4):4981-4996. doi: 10.1109/TPAMI.2022.3198411. Epub 2023 Mar 7.

Shapelet selection based on a genetic algorithm for remaining useful life prediction with supervised learning.

Heliyon. 2022 Dec 7;8(12):e12111. doi: 10.1016/j.heliyon.2022.e12111. eCollection 2022 Dec.

LTSpAUC: Learning Time-Series Shapelets for Partial AUC Maximization.

Big Data. 2020 Oct;8(5):391-411. doi: 10.1089/big.2020.0069.

Theory and Algorithms for Shapelet-Based Multiple-Instance Learning.

Neural Comput. 2020 Aug;32(8):1580-1613. doi: 10.1162/neco_a_01297. Epub 2020 Jun 10.

Learning Shapelets for Improving Single-Molecule Nanopore Sensing.

Anal Chem. 2019 Aug 6;91(15):10033-10039. doi: 10.1021/acs.analchem.9b01896. Epub 2019 Jul 18.

引用本文的文献

W-TSS: A Wavelet-Based Algorithm for Discovering Time Series Shapelets.

Sensors (Basel). 2021 Aug 28;21(17):5801. doi: 10.3390/s21175801.

本文引用的文献

Early prediction of circulatory failure in the intensive care unit using machine learning.

Nat Med. 2020 Mar;26(3):364-373. doi: 10.1038/s41591-020-0789-4. Epub 2020 Mar 9.

Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping.

KDD. 2012 Aug;2012:262-270. doi: 10.1145/2339530.2339576.

Exact Discovery of Time Series Motifs.

Proc SIAM Int Conf Data Min. 2009;2009:473-484. doi: 10.1137/1.9781611972795.41.

SPADIS: An Algorithm for Selecting Predictive and Diverse SNPs in GWAS.

IEEE/ACM Trans Comput Biol Bioinform. 2021 May-Jun;18(3):1208-1216. doi: 10.1109/TCBB.2019.2935437. Epub 2021 Jun 3.

CASMAP: detection of statistically significant combinations of SNPs in association mapping.

Bioinformatics. 2019 Aug 1;35(15):2680-2682. doi: 10.1093/bioinformatics/bty1020.

The eICU Collaborative Research Database, a freely available multi-center database for critical care research.

Sci Data. 2018 Sep 11;5:180178. doi: 10.1038/sdata.2018.178.

Association mapping in biomedical time series via statistically significant shapelet mining.

Bioinformatics. 2018 Jul 1;34(13):i438-i446. doi: 10.1093/bioinformatics/bty246.

Serial evaluation of the SOFA score is reliable for predicting mortality in acute severe pancreatitis.

Medicine (Baltimore). 2018 Feb;97(7):e9654. doi: 10.1097/MD.0000000000009654.

Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization.

Proteins. 2018 Apr;86(4):454-466. doi: 10.1002/prot.25461. Epub 2018 Feb 1.

The MIMIC Code Repository: enabling reproducibility in critical care research.

J Am Med Inform Assoc. 2018 Jan 1;25(1):32-39. doi: 10.1093/jamia/ocx084.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

通过代表性形状子挖掘提高时间生物标志物发现中的统计功效。

Enhancing statistical power in temporal biomarker discovery through representative shapelet mining.

作者信息

Gumbsch Thomas, Bock Christian, Moor Michael, Rieck Bastian, Borgwardt Karsten

机构信息

Department of Biosystems Science and Engineering, ETH Zurich, Basel 4058, Switzerland.

SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland.