Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, USA.
Center for Computational Biology, Johns Hopkins University, USA.
Nucleic Acids Res. 2018 May 18;46(9):e54. doi: 10.1093/nar/gky102.
Publicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions. We develop an in silico phenotyping approach for predicting critical missing annotation directly from genomic measurements using well-annotated genomic and phenotypic data produced by consortia like TCGA and GTEx as training data. We apply in silico phenotyping to a set of 70 000 RNA-seq samples we recently processed on a common pipeline as part of the recount2 project. We use gene expression data to build and evaluate predictors for both biological phenotypes (sex, tissue, sample source) and experimental conditions (sequencing strategy). We demonstrate how these predictions can be used to study cross-sample properties of public genomic data, select genomic projects with specific characteristics, and perform downstream analyses using predicted phenotypes. The methods to perform phenotype prediction are available in the phenopredict R package and the predictions for recount2 are available from the recount R package. With data and phenotype information available for 70,000 human samples, expression data is available for use on a scale that was not previously feasible.
公开可用的基因组数据是研究正常人类变异和疾病的宝贵资源,但这些数据通常没有很好的标记或注释。公共基因组数据缺乏表型信息,严重限制了它们在解决有针对性的生物学问题中的应用。我们开发了一种计算表型预测方法,可直接从基因组测量值中预测关键的缺失注释,方法是使用 TCGA 和 GTEx 等联盟生成的经过良好注释的基因组和表型数据作为训练数据。我们将计算表型预测应用于一组 70000 个最近在 recount2 项目中使用公共管道处理的 RNA-seq 样本。我们使用基因表达数据来构建和评估生物表型(性别、组织、样本来源)和实验条件(测序策略)的预测因子。我们展示了如何使用这些预测来研究公共基因组数据的跨样本特性,选择具有特定特征的基因组项目,并使用预测的表型进行下游分析。执行表型预测的方法可在 phenopredict R 包中使用,recount2 的预测可在 recount R 包中获得。有了 70000 个人类样本的数据和表型信息,表达数据的使用规模是以前无法实现的。