Prieto Carlos, Risueño Alberto, Fontanillo Celia, De las Rivas Javier
Bioinformatics and Functional Genomics Research Group, Cancer Research Center (CIC-IBMCC, CSIC/USAL), Salamanca, Spain.
PLoS One. 2008;3(12):e3911. doi: 10.1371/journal.pone.0003911. Epub 2008 Dec 15.
Analysis of gene expression data using genome-wide microarrays is a technique often used in genomic studies to find coexpression patterns and locate groups of co-transcribed genes. However, most studies done at global "omic" scale are not focused on human samples and when they correspond to human very often include heterogeneous datasets, mixing normal with disease-altered samples. Moreover, the technical noise present in genome-wide expression microarrays is another well reported problem that many times is not addressed with robust statistical methods, and the estimation of errors in the data is not provided.
METHODOLOGY/PRINCIPAL FINDINGS: Human genome-wide expression data from a controlled set of normal-healthy tissues is used to build a confident human gene coexpression network avoiding both pathological and technical noise. To achieve this we describe a new method that combines several statistical and computational strategies: robust normalization and expression signal calculation; correlation coefficients obtained by parametric and non-parametric methods; random cross-validations; and estimation of the statistical accuracy and coverage of the data. All these methods provide a series of coexpression datasets where the level of error is measured and can be tuned. To define the errors, the rates of true positives are calculated by assignment to biological pathways. The results provide a confident human gene coexpression network that includes 3327 gene-nodes and 15841 coexpression-links and a comparative analysis shows good improvement over previously published datasets. Further functional analysis of a subset core network, validated by two independent methods, shows coherent biological modules that share common transcription factors. The network reveals a map of coexpression clusters organized in well defined functional constellations. Two major regions in this network correspond to genes involved in nuclear and mitochondrial metabolism and investigations on their functional assignment indicate that more than 60% are house-keeping and essential genes. The network displays new non-described gene associations and it allows the placement in a functional context of some unknown non-assigned genes based on their interactions with known gene families.
CONCLUSIONS/SIGNIFICANCE: The identification of stable and reliable human gene to gene coexpression networks is essential to unravel the interactions and functional correlations between human genes at an omic scale. This work contributes to this aim, and we are making available for the scientific community the validated human gene coexpression networks obtained, to allow further analyses on the network or on some specific gene associations. The data are available free online at http://bioinfow.dep.usal.es/coexpression/.
使用全基因组微阵列分析基因表达数据是基因组研究中常用的技术,用于寻找共表达模式并定位共转录基因群体。然而,大多数在全球“组学”规模上进行的研究并非聚焦于人类样本,而且当研究对象是人类时,往往包含异质数据集,将正常样本与疾病改变的样本混合在一起。此外,全基因组表达微阵列中存在的技术噪声是另一个有大量报道的问题,很多时候没有用稳健的统计方法来解决,并且也没有提供数据误差的估计。
方法/主要发现:来自一组受控的正常健康组织的人类全基因组表达数据被用于构建一个可靠的人类基因共表达网络,以避免病理和技术噪声。为实现这一目标,我们描述了一种结合多种统计和计算策略的新方法:稳健归一化和表达信号计算;通过参数和非参数方法获得的相关系数;随机交叉验证;以及数据统计准确性和覆盖范围的估计。所有这些方法提供了一系列共表达数据集,其中误差水平得到测量且可以调整。为了定义误差,通过分配到生物途径来计算真阳性率。结果提供了一个可靠的人类基因共表达网络,包含3327个基因节点和15841个共表达链接,比较分析表明相较于先前发表的数据集有显著改进。通过两种独立方法验证的子集核心网络的进一步功能分析显示了共享共同转录因子的连贯生物模块。该网络揭示了以明确的功能星座形式组织的共表达簇图谱。该网络中的两个主要区域对应于参与核代谢和线粒体代谢的基因,对其功能分配的研究表明超过60%是管家基因和必需基因。该网络展示了新的未描述的基因关联,并且基于一些未知未分配基因与已知基因家族的相互作用,它能够将这些基因置于功能背景中。
结论/意义:识别稳定可靠的人类基因间共表达网络对于在组学规模上揭示人类基因之间的相互作用和功能相关性至关重要。这项工作有助于实现这一目标,并且我们正在向科学界提供所获得的经过验证的人类基因共表达网络,以便对该网络或某些特定基因关联进行进一步分析。数据可在http://bioinfow.dep.usal.es/coexpression/免费在线获取。