Suppr超能文献

临床试验数据库中异构和高维时间序列数据的信息挖掘。

Information mining over heterogeneous and high-dimensional time-series data in clinical trials databases.

作者信息

Altiparmak Fatih, Ferhatosmanoglu Hakan, Erdal Selnur, Trost Donald C

机构信息

Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210, USA.

出版信息

IEEE Trans Inf Technol Biomed. 2006 Apr;10(2):254-63. doi: 10.1109/titb.2005.859885.

Abstract

An effective analysis of clinical trials data involves analyzing different types of data such as heterogeneous and high dimensional time series data. The current time series analysis methods generally assume that the series at hand have sufficient length to apply statistical techniques to them. Other ideal case assumptions are that data are collected in equal length intervals, and while comparing time series, the lengths are usually expected to be equal to each other. However, these assumptions are not valid for many real data sets, especially for the clinical trials data sets. An addition, the data sources are different from each other, the data are heterogeneous, and the sensitivity of the experiments varies by the source. Approaches for mining time series data need to be revisited, keeping the wide range of requirements in mind. In this paper, we propose a novel approach for information mining that involves two major steps: applying a data mining algorithm over homogeneous subsets of data, and identifying common or distinct patterns over the information gathered in the first step. Our approach is implemented specifically for heterogeneous and high dimensional time series clinical trials data. Using this framework, we propose a new way of utilizing frequent itemset mining, as well as clustering and declustering techniques with novel distance metrics for measuring similarity between time series data. By clustering the data, we find groups of analytes (substances in blood) that are most strongly correlated. Most of these relationships already known are verified by the clinical panels, and, in addition, we identify novel groups that need further biomedical analysis. A slight modification to our algorithm results an effective declustering of high dimensional time series data, which is then used for "feature selection." Using industry-sponsored clinical trials data sets, we are able to identify a small set of analytes that effectively models the state of normal health.

摘要

对临床试验数据进行有效的分析,需要分析不同类型的数据,如异构和高维时间序列数据。当前的时间序列分析方法通常假定手头的序列有足够的长度,以便对其应用统计技术。其他理想情况的假设是,数据以等长间隔收集,并且在比较时间序列时,通常期望其长度彼此相等。然而,这些假设对许多实际数据集并不成立,尤其是对于临床试验数据集。此外,数据源各不相同,数据是异构的,并且实验的灵敏度因源而异。需要重新审视挖掘时间序列数据的方法,同时牢记广泛的需求。在本文中,我们提出了一种新颖的信息挖掘方法,该方法包括两个主要步骤:对数据的同质子集应用数据挖掘算法,以及在第一步收集的信息中识别共同或不同的模式。我们的方法是专门针对异构和高维时间序列临床试验数据实现的。使用这个框架,我们提出了一种利用频繁项集挖掘的新方法,以及使用新颖的距离度量来测量时间序列数据之间相似度的聚类和去聚类技术。通过对数据进行聚类,我们找到了相关性最强的分析物(血液中的物质)组。临床专家组验证了大多数这些已知的关系,此外,我们还识别出了需要进一步进行生物医学分析的新组。对我们算法进行轻微修改,可实现对高维时间序列数据的有效去聚类,然后将其用于“特征选择”。使用行业赞助的临床试验数据集,我们能够识别出一小部分能够有效模拟正常健康状态的分析物。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验