使用分子谱分析技术发现生物标志物的数据驱动分析方法。

Data-driven analysis approach for biomarker discovery using molecular-profiling technologies.

作者信息

Wei T, Liao B, Ackermann B L, Jolly R A, Eckstein J A, Kulkarni N H, Helvering L M, Goldstein K M, Shou J, Estrem S T, Ryan T P, Colet J-M, Thomas C E, Stevens J L, Onyia J E

机构信息

Integrative Biology, Lilly Research Laboratories, Greenfield, IN 46140, USA.

出版信息

Biomarkers. 2005 Mar-Jun;10(2-3):153-72. doi: 10.1080/13547500500107430.

DOI:10.1080/13547500500107430

PMID:16076730

Abstract

High-throughput molecular-profiling technologies provide rapid, efficient and systematic approaches to search for biomarkers. Supervised learning algorithms are naturally suited to analyse a large amount of data generated using these technologies in biomarker discovery efforts. The study demonstrates with two examples a data-driven analysis approach to analysis of large complicated datasets collected in high-throughput technologies in the context of biomarker discovery. The approach consists of two analytic steps: an initial unsupervised analysis to obtain accurate knowledge about sample clustering, followed by a second supervised analysis to identify a small set of putative biomarkers for further experimental characterization. By comparing the most widely applied clustering algorithms using a leukaemia DNA microarray dataset, it was established that principal component analysis-assisted projections of samples from a high-dimensional molecular feature space into a few low dimensional subspaces provides a more effective and accurate way to explore visually and identify data structures that confirm intended experimental effects based on expected group membership. A supervised analysis method, shrunken centroid algorithm, was chosen to take knowledge of sample clustering gained or confirmed by the first step of the analysis to identify a small set of molecules as candidate biomarkers for further experimentation. The approach was applied to two molecular-profiling studies. In the first study, PCA-assisted analysis of DNA microarray data revealed that discrete data structures exist in rat liver gene expression and correlated with blood clinical chemistry and liver pathological damage in response to a chemical toxicant diethylhexylphthalate, a peroxisome-proliferator-activator receptor agonist. Sixteen genes were then identified by shrunken centroid algorithm as the best candidate biomarkers for liver damage. Functional annotations of these genes revealed roles in acute phase response, lipid and fatty acid metabolism and they are functionally relevant to the observed toxicities. In the second study, 26 urine ions identified from a GC/MS spectrum, two of which were glucose fragment ions included as positive controls, showed robust changes with the development of diabetes in Zucker diabetic fatty rats. Further experiments are needed to define their chemical identities and establish functional relevancy to disease development.

摘要

高通量分子谱分析技术为寻找生物标志物提供了快速、高效且系统的方法。监督学习算法天然适用于在生物标志物发现工作中分析使用这些技术生成的大量数据。该研究通过两个例子展示了一种数据驱动的分析方法，用于在生物标志物发现背景下分析高通量技术收集的大型复杂数据集。该方法包括两个分析步骤：首先进行无监督分析以获取关于样本聚类的准确知识，然后进行第二次监督分析以识别一小部分假定的生物标志物，用于进一步的实验表征。通过使用白血病DNA微阵列数据集比较最广泛应用的聚类算法，确定了将样本从高维分子特征空间主成分分析辅助投影到几个低维子空间，能提供一种更有效、准确的方式来直观探索和识别基于预期组成员身份确认预期实验效果的数据结构。选择一种监督分析方法——收缩质心算法，利用分析第一步获得或确认的样本聚类知识，识别一小部分分子作为进一步实验的候选生物标志物。该方法应用于两项分子谱分析研究。在第一项研究中，对DNA微阵列数据进行主成分分析辅助分析发现，大鼠肝脏基因表达中存在离散的数据结构，且与化学毒物邻苯二甲酸二异辛酯（一种过氧化物酶体增殖物激活受体激动剂）诱导的血液临床化学指标及肝脏病理损伤相关。然后通过收缩质心算法确定了16个基因作为肝脏损伤的最佳候选生物标志物。这些基因的功能注释揭示了它们在急性期反应、脂质和脂肪酸代谢中的作用，并且在功能上与观察到的毒性相关。在第二项研究中，从气相色谱/质谱谱图中鉴定出26种尿液离子，其中两种葡萄糖碎片离子作为阳性对照，随着Zucker糖尿病肥胖大鼠糖尿病的发展呈现出显著变化。需要进一步实验来确定它们的化学身份，并建立与疾病发展的功能相关性。