Constantino Cláudia S, Carvalho Alexandra M, Vinga Susana
INESC-ID, Instituto Superior Técnico, ULisboa, R. Alves Redol 9, Lisbon, 1000-029, Portugal.
Instituto de Telecomunicações, Instituto Superior Técnico, ULisboa, Av. Rovisco Pais 1, Lisbon, 1049-001, Portugal.
BioData Min. 2021 Apr 14;14(1):25. doi: 10.1186/s13040-021-00257-8.
Longitudinal gene expression analysis and survival modeling have been proved to add valuable biological and clinical knowledge. This study proposes a novel framework to discover gene signatures and patterns in a high-dimensional time series transcriptomics data and to assess their association with hospital length of stay.
We investigated a longitudinal and high-dimensional gene expression dataset from 168 blunt-force trauma patients followed during the first 28 days after injury. To model the length of stay, an initial dimensionality reduction step was performed by applying Cox regression with elastic net regularization using gene expression data from the first hospitalization days. Also, a novel methodology to impute missing values to the genes selected previously was proposed. We then applied multivariate time series (MTS) clustering to analyse gene expression over time and to stratify patients with similar trajectories. The validation of the patients' partitions obtained by MTS clustering was performed using Kaplan-Meier curves and log-rank tests.
We were able to unravel 22 genes strongly associated with hospital's discharge. Their expression values in the first days after trauma showed to be good predictors of the length of stay. The proposed mixed imputation method allowed to achieve a complete dataset of short time series with a minimum loss of information for the 28 days of follow-up. MTS clustering enabled to group patients with similar genes trajectories and, notably, with similar discharge days from the hospital. Patients within each cluster have comparable genes' trajectories and may have an analogous response to injury.
The proposed framework was able to tackle the joint analysis of time-to-event information with longitudinal multivariate high-dimensional data. The application to length of stay and transcriptomics data revealed a strong relationship between gene expression trajectory and patients' recovery, which may improve trauma patient's management by healthcare systems. The proposed methodology can be easily adapted to other medical data, towards more effective clinical decision support systems for health applications.
纵向基因表达分析和生存建模已被证明能增加有价值的生物学和临床知识。本研究提出了一个新颖的框架,用于在高维时间序列转录组学数据中发现基因特征和模式,并评估它们与住院时间的关联。
我们研究了168例钝器伤患者在受伤后前28天的纵向高维基因表达数据集。为了对住院时间进行建模,通过应用带有弹性网络正则化的Cox回归,使用首次住院日的基因表达数据进行了初始降维步骤。此外,还提出了一种新方法来填补先前选择的基因的缺失值。然后,我们应用多变量时间序列(MTS)聚类来分析基因表达随时间的变化,并对具有相似轨迹的患者进行分层。使用Kaplan-Meier曲线和对数秩检验对通过MTS聚类获得的患者分区进行验证。
我们能够找出22个与出院密切相关的基因。它们在创伤后最初几天的表达值显示是住院时间的良好预测指标。所提出的混合填补方法能够获得一个完整的短时间序列数据集,在28天的随访中信息损失最小。MTS聚类能够将具有相似基因轨迹的患者分组,特别是具有相似出院日期的患者。每个聚类中的患者具有可比的基因轨迹,并且可能对损伤有类似的反应。
所提出的框架能够处理事件发生时间信息与纵向多变量高维数据的联合分析。应用于住院时间和转录组学数据揭示了基因表达轨迹与患者恢复之间的密切关系,这可能会改善医疗系统对创伤患者的管理。所提出的方法可以很容易地适用于其他医学数据,以建立更有效的健康应用临床决策支持系统。