一种从时间序列基因表达数据集中挖掘跨时间点基因调控序列模式的有效方法。

An efficient method for mining cross-timepoint gene regulation sequential patterns from time course gene expression datasets.

出版信息

BMC Bioinformatics. 2013;14 Suppl 12(Suppl 12):S3. doi: 10.1186/1471-2105-14-S12-S3. Epub 2013 Sep 24.

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3848764/

Abstract

BACKGROUND

Observation of gene expression changes implying gene regulations using a repetitive experiment in time course has become more and more important. However, there is no effective method which can handle such kind of data. For instance, in a clinical/biological progression like inflammatory response or cancer formation, a great number of differentially expressed genes at different time points could be identified through a large-scale microarray approach. For each repetitive experiment with different samples, converting the microarray datasets into transactional databases with significant singleton genes at each time point would allow sequential patterns implying gene regulations to be identified. Although traditional sequential pattern mining methods have been successfully proposed and widely used in different interesting topics, like mining customer purchasing sequences from a transactional database, to our knowledge, the methods are not suitable for such biological dataset because every transaction in the converted database may contain too many items/genes.

RESULTS

In this paper, we propose a new algorithm called CTGR-Span (Cross-Timepoint Gene Regulation Sequential pattern) to efficiently mine CTGR-SPs (Cross-Timepoint Gene Regulation Sequential Patterns) even on larger datasets where traditional algorithms are infeasible. The CTGR-Span includes several biologically designed parameters based on the characteristics of gene regulation. We perform an optimal parameter tuning process using a GO enrichment analysis to yield CTGR-SPs more meaningful biologically. The proposed method was evaluated with two publicly available human time course microarray datasets and it was shown that it outperformed the traditional methods in terms of execution efficiency. After evaluating with previous literature, the resulting patterns also strongly correlated with the experimental backgrounds of the datasets used in this study.

CONCLUSIONS

We propose an efficient CTGR-Span to mine several biologically meaningful CTGR-SPs. We postulate that the biologist can benefit from our new algorithm since the patterns implying gene regulations could provide further insights into the mechanisms of novel gene regulations during a biological or clinical progression. The Java source code, program tutorial and other related materials used in this program are available at http://websystem.csie.ncku.edu.tw/CTGR-Span.rar.

摘要

背景

通过时间进程中的重复实验观察暗示基因调控的基因表达变化变得越来越重要。然而，目前还没有有效的方法可以处理此类数据。例如，在炎症反应或癌症形成等临床/生物学进展中，可以通过大规模微阵列方法在不同时间点识别出大量差异表达的基因。对于每个具有不同样本的重复实验，将微阵列数据集转换为具有每个时间点显著单基因的事务数据库，将允许识别暗示基因调控的序列模式。尽管传统的序列模式挖掘方法已经在不同的有趣主题中被成功提出并广泛使用，例如从事务数据库中挖掘客户购买序列，但据我们所知，这些方法不适合此类生物数据集，因为转换后的数据库中的每个事务可能包含太多的项目/基因。

结果

在本文中，我们提出了一种称为 CTGR-Span（跨时间点基因调控序列模式）的新算法，即使在传统算法不可行的更大数据集上，也能有效地挖掘 CTGR-SPs（跨时间点基因调控序列模式）。CTGR-Span 包括基于基因调控特征设计的几个生物学参数。我们使用 GO 富集分析进行了最佳参数调整过程，以产生更具生物学意义的 CTGR-SPs。该方法使用两个公开的人类时间序列微阵列数据集进行了评估，结果表明，它在执行效率方面优于传统方法。在与之前的文献进行评估后，得到的模式也与本研究中使用的数据集的实验背景强烈相关。

结论

我们提出了一种高效的 CTGR-Span 来挖掘几个具有生物学意义的 CTGR-SPs。我们假设，生物学家可以从我们的新算法中受益，因为暗示基因调控的模式可以为在生物学或临床进展过程中发现新的基因调控机制提供进一步的见解。该程序中使用的 Java 源代码、程序教程和其他相关材料可在 http://websystem.csie.ncku.edu.tw/CTGR-Span.rar 获得。