一种在功能基因组研究中识别特定检测特征的新型数据挖掘方法。

A novel data mining method to identify assay-specific signatures in functional genomic studies.

作者信息

Rollins Derrick K, Zhai Dongmei, Joe Alrica L, Guidarelli Jack W, Murarka Abhishek, Gonzalez Ramon

机构信息

Department of Chemical and Biological Engineering, Iowa State University, Ames, Iowa 50011, USA.

出版信息

BMC Bioinformatics. 2006 Aug 14;7:377. doi: 10.1186/1471-2105-7-377.

DOI:10.1186/1471-2105-7-377

PMID:16907975

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1599756/

Abstract

BACKGROUND

The highly dimensional data produced by functional genomic (FG) studies makes it difficult to visualize relationships between gene products and experimental conditions (i.e., assays). Although dimensionality reduction methods such as principal component analysis (PCA) have been very useful, their application to identify assay-specific signatures has been limited by the lack of appropriate methodologies. This article proposes a new and powerful PCA-based method for the identification of assay-specific gene signatures in FG studies.

RESULTS

The proposed method (PM) is unique for several reasons. First, it is the only one, to our knowledge, that uses gene contribution, a product of the loading and expression level, to obtain assay signatures. The PM develops and exploits two types of assay-specific contribution plots, which are new to the application of PCA in the FG area. The first type plots the assay-specific gene contribution against the given order of the genes and reveals variations in distribution between assay-specific gene signatures as well as outliers within assay groups indicating the degree of importance of the most dominant genes. The second type plots the contribution of each gene in ascending or descending order against a constantly increasing index. This type of plots reveals assay-specific gene signatures defined by the inflection points in the curve. In addition, sharp regions within the signature define the genes that contribute the most to the signature. We proposed and used the curvature as an appropriate metric to characterize these sharp regions, thus identifying the subset of genes contributing the most to the signature. Finally, the PM uses the full dataset to determine the final gene signature, thus eliminating the chance of gene exclusion by poor screening in earlier steps. The strengths of the PM are demonstrated using a simulation study, and two studies of real DNA microarray data--a study of classification of human tissue samples and a study of E. coli cultures with different medium formulations.

CONCLUSION

We have developed a PCA-based method that effectively identifies assay-specific signatures in ranked groups of genes from the full data set in a more efficient and simplistic procedure than current approaches. Although this work demonstrates the ability of the PM to identify assay-specific signatures in DNA microarray experiments, this approach could be useful in areas such as proteomics and metabolomics.

摘要

背景

功能基因组（FG）研究产生的高维数据使得难以直观呈现基因产物与实验条件（即分析方法）之间的关系。尽管诸如主成分分析（PCA）等降维方法非常有用，但其在识别分析方法特异性特征方面的应用因缺乏合适的方法而受到限制。本文提出了一种基于PCA的全新且强大的方法，用于在FG研究中识别分析方法特异性基因特征。

结果

所提出的方法（PM）在几个方面具有独特性。首先，据我们所知，它是唯一一种利用基因贡献（负荷与表达水平的乘积）来获取分析方法特征的方法。PM开发并利用了两种类型的分析方法特异性贡献图，这在PCA应用于FG领域中是全新的。第一种类型将分析方法特异性基因贡献相对于基因的给定顺序进行绘图，揭示了分析方法特异性基因特征之间分布的差异以及分析方法组内的异常值，表明最主要基因的重要程度。第二种类型将每个基因的贡献按升序或降序相对于不断增加的索引进行绘图。这种类型的图揭示了由曲线中的拐点定义的分析方法特异性基因特征。此外，特征内的尖锐区域定义了对特征贡献最大的基因。我们提出并使用曲率作为表征这些尖锐区域的合适指标，从而识别对特征贡献最大的基因子集。最后，PM使用完整数据集来确定最终的基因特征，从而消除了早期步骤中因筛选不佳而排除基因的可能性。通过模拟研究以及两项关于真实DNA微阵列数据的研究——一项关于人类组织样本分类的研究和一项关于不同培养基配方的大肠杆菌培养物的研究，证明了PM的优势。