Suppr超能文献

临床神经科学中的功能基因组学和蛋白质组学:数据挖掘与生物信息学

Functional genomics and proteomics in the clinical neurosciences: data mining and bioinformatics.

作者信息

Phan John H, Quo Chang-Feng, Wang May D

机构信息

The Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA 30322, USA.

出版信息

Prog Brain Res. 2006;158:83-108. doi: 10.1016/S0079-6123(06)58004-5.

Abstract

The goal of this chapter is to introduce some of the available computational methods for expression analysis. Genomic and proteomic experimental techniques are briefly discussed to help the reader understand these methods and results better in context with the biological significance. Furthermore, a case study is presented that will illustrate the use of these analytical methods to extract significant biomarkers from high-throughput microarray data. Genomic and proteomic data analysis is essential for understanding the underlying factors that are involved in human disease. Currently, such experimental data are generally obtained by high-throughput microarray or mass spectrometry technologies among others. The sheer amount of raw data obtained using these methods warrants specialized computational methods for data analysis. Biomarker discovery for neurological diagnosis and prognosis is one such example. By extracting significant genomic and proteomic biomarkers in controlled experiments, we come closer to understanding how biological mechanisms contribute to neural degenerative diseases such as Alzheimers' and how drug treatments interact with the nervous system. In the biomarker discovery process, there are several computational methods that must be carefully considered to accurately analyze genomic or proteomic data. These methods include quality control, clustering, classification, feature ranking, and validation. Data quality control and normalization methods reduce technical variability and ensure that discovered biomarkers are statistically significant. Preprocessing steps must be carefully selected since they may adversely affect the results of the following expression analysis steps, which generally fall into two categories: unsupervised and supervised. Unsupervised or clustering methods can be used to group similar genomic or proteomic profiles and therefore can elucidate relationships within sample groups. These methods can also assign biomarkers to sub-groups based on their expression profiles across patient samples. Although clustering is useful for exploratory analysis, it is limited due to its inability to incorporate expert knowledge. On the other hand, classification and feature ranking are supervised, knowledge-based machine learning methods that estimate the distribution of biological expression data and, in doing so, can extract important information about these experiments. Classification is closely coupled with feature ranking, which is essentially a data reduction method that uses classification error estimation or other statistical tests to score features. Biomarkers can subsequently be extracted by eliminating insignificantly ranked features. These analytical methods may be equally applied to genetic and proteomic data. However, because of both biological differences between the data sources and technical differences between the experimental methods used to obtain these data, it is important to have a firm understanding of the data sources and experimental methods. At the same time, regardless of the data quality, it is inevitable that some discovered biomarkers are false positives. Thus, it is important to validate discovered biomarkers. The validation process may be slow; yet, the overall biomarker discovery process is significantly accelerated due to initial feature ranking and data reduction steps. Information obtained from the validation process may also be used to refine data analysis procedures for future iteration. Biomarker validation may be performed in a number of ways - bench-side in traditional labs, web-based electronic resources such as gene ontology and literature databases, and clinical trials.

摘要

本章的目标是介绍一些现有的用于表达分析的计算方法。简要讨论基因组学和蛋白质组学实验技术,以帮助读者在生物学意义的背景下更好地理解这些方法和结果。此外,还给出了一个案例研究,将说明如何使用这些分析方法从高通量微阵列数据中提取重要的生物标志物。基因组学和蛋白质组学数据分析对于理解人类疾病所涉及的潜在因素至关重要。目前,此类实验数据通常通过高通量微阵列或质谱技术等获得。使用这些方法获得的原始数据量巨大,因此需要专门的计算方法进行数据分析。用于神经学诊断和预后的生物标志物发现就是这样一个例子。通过在对照实验中提取重要的基因组和蛋白质组生物标志物,我们更接近于了解生物机制如何导致诸如阿尔茨海默病等神经退行性疾病,以及药物治疗如何与神经系统相互作用。在生物标志物发现过程中,有几种计算方法必须仔细考虑,以准确分析基因组或蛋白质组数据。这些方法包括质量控制、聚类、分类、特征排序和验证。数据质量控制和归一化方法可减少技术变异性,并确保发现的生物标志物具有统计学意义。必须仔细选择预处理步骤,因为它们可能会对后续的表达分析步骤产生不利影响,后续步骤通常分为两类:无监督和有监督。无监督或聚类方法可用于对相似的基因组或蛋白质组谱进行分组,从而阐明样本组内的关系。这些方法还可根据生物标志物在患者样本中的表达谱将其分配到亚组。尽管聚类对于探索性分析很有用,但由于其无法纳入专家知识,因此存在局限性。另一方面,分类和特征排序是基于知识的有监督机器学习方法,可估计生物表达数据的分布,并借此提取有关这些实验的重要信息。分类与特征排序密切相关,特征排序本质上是一种数据约简方法,它使用分类误差估计或其他统计检验对特征进行评分。随后可通过消除排名不显著的特征来提取生物标志物。这些分析方法可同样应用于遗传数据和蛋白质组学数据。然而,由于数据源之间的生物学差异以及用于获取这些数据的实验方法之间的技术差异,深入了解数据源和实验方法很重要。同时,无论数据质量如何,不可避免地会有一些发现的生物标志物是假阳性。因此,验证发现的生物标志物很重要。验证过程可能很缓慢;然而,由于初始的特征排序和数据约简步骤,整体生物标志物发现过程显著加速。从验证过程中获得的信息也可用于完善未来迭代的数据分析程序。生物标志物验证可通过多种方式进行——在传统实验室的实验台进行、基于网络的电子资源如基因本体和文献数据库,以及临床试验。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验