通过表型预测提高公共 RNA-seq 表达数据的价值。

Improving the value of public RNA-seq expression data by phenotype prediction.

机构信息

Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, USA.

Center for Computational Biology, Johns Hopkins University, USA.

出版信息

Nucleic Acids Res. 2018 May 18;46(9):e54. doi: 10.1093/nar/gky102.

DOI:10.1093/nar/gky102

PMID:29514223

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5961118/

Abstract

Publicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions. We develop an in silico phenotyping approach for predicting critical missing annotation directly from genomic measurements using well-annotated genomic and phenotypic data produced by consortia like TCGA and GTEx as training data. We apply in silico phenotyping to a set of 70 000 RNA-seq samples we recently processed on a common pipeline as part of the recount2 project. We use gene expression data to build and evaluate predictors for both biological phenotypes (sex, tissue, sample source) and experimental conditions (sequencing strategy). We demonstrate how these predictions can be used to study cross-sample properties of public genomic data, select genomic projects with specific characteristics, and perform downstream analyses using predicted phenotypes. The methods to perform phenotype prediction are available in the phenopredict R package and the predictions for recount2 are available from the recount R package. With data and phenotype information available for 70,000 human samples, expression data is available for use on a scale that was not previously feasible.

摘要

公开可用的基因组数据是研究正常人类变异和疾病的宝贵资源，但这些数据通常没有很好的标记或注释。公共基因组数据缺乏表型信息，严重限制了它们在解决有针对性的生物学问题中的应用。我们开发了一种计算表型预测方法，可直接从基因组测量值中预测关键的缺失注释，方法是使用 TCGA 和 GTEx 等联盟生成的经过良好注释的基因组和表型数据作为训练数据。我们将计算表型预测应用于一组 70000 个最近在 recount2 项目中使用公共管道处理的 RNA-seq 样本。我们使用基因表达数据来构建和评估生物表型（性别、组织、样本来源）和实验条件（测序策略）的预测因子。我们展示了如何使用这些预测来研究公共基因组数据的跨样本特性，选择具有特定特征的基因组项目，并使用预测的表型进行下游分析。执行表型预测的方法可在 phenopredict R 包中使用，recount2 的预测可在 recount R 包中获得。有了 70000 个人类样本的数据和表型信息，表达数据的使用规模是以前无法实现的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0fa7/5961118/7a25697cfec6/gky102fig1.jpg

相似文献

Improving the value of public RNA-seq expression data by phenotype prediction.

Nucleic Acids Res. 2018 May 18;46(9):e54. doi: 10.1093/nar/gky102.

recount workflow: Accessing over 70,000 human RNA-seq samples with Bioconductor.

F1000Res. 2017 Aug 24;6:1558. doi: 10.12688/f1000research.12223.1. eCollection 2017.

ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets.

BMC Bioinformatics. 2011 Nov 16;12:449. doi: 10.1186/1471-2105-12-449.

CodingQuarry: highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts.

BMC Genomics. 2015 Mar 11;16(1):170. doi: 10.1186/s12864-015-1344-4.

CDSeqR: fast complete deconvolution for gene expression data from bulk tissues.

BMC Bioinformatics. 2021 May 24;22(1):262. doi: 10.1186/s12859-021-04186-5.

Phenotype prediction from single-cell RNA-seq data using attention-based neural networks.

Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae067.

Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data.

BMC Bioinformatics. 2017 Oct 3;18(1):437. doi: 10.1186/s12859-017-1847-x.

rPanglaoDB: an R package to download and merge labeled single-cell RNA-seq data from the PanglaoDB database.

Bioinformatics. 2022 Jan 3;38(2):580-582. doi: 10.1093/bioinformatics/btab549.

[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].

Yi Chuan Xue Bao. 2004 May;31(5):431-43.

引用本文的文献

A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies.

BMC Bioinformatics. 2024 May 8;25(1):181. doi: 10.1186/s12859-024-05801-x.

Genotype prediction of 336,463 samples from public expression data.

bioRxiv. 2024 Mar 13:2023.10.21.562237. doi: 10.1101/2023.10.21.562237.

Differential Expression Enrichment Tool (DEET): an interactive atlas of human differential gene expression.

NAR Genom Bioinform. 2023 Jan 23;5(1):lqad003. doi: 10.1093/nargab/lqad003. eCollection 2023 Mar.

Application of Single-Cell RNA Sequencing in Ovarian Development.

Biomolecules. 2022 Dec 27;13(1):47. doi: 10.3390/biom13010047.

Systematic tissue annotations of genomics samples by modeling unstructured metadata.

Nat Commun. 2022 Nov 8;13(1):6736. doi: 10.1038/s41467-022-34435-x.

Current progress and open challenges for applying deep learning across the biosciences.

Nat Commun. 2022 Apr 1;13(1):1728. doi: 10.1038/s41467-022-29268-7.

recount3: summaries and queries for large-scale RNA-seq expression and splicing.

Genome Biol. 2021 Nov 29;22(1):323. doi: 10.1186/s13059-021-02533-6.

Bias-invariant RNA-sequencing metadata annotation.

Gigascience. 2021 Sep 22;10(9). doi: 10.1093/gigascience/giab064.

Autosomal sex-associated co-methylated regions predict biological sex from DNA methylation.

Nucleic Acids Res. 2021 Sep 20;49(16):9097-9116. doi: 10.1093/nar/gkab682.

Large-scale labeling and assessment of sex bias in publicly available expression data.

BMC Bioinformatics. 2021 Mar 30;22(1):168. doi: 10.1186/s12859-021-04070-2.

本文引用的文献

MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive.

Bioinformatics. 2017 Sep 15;33(18):2914-2923. doi: 10.1093/bioinformatics/btx334.

Reproducible RNA-seq analysis using recount2.

Nat Biotechnol. 2017 Apr 11;35(4):319-321. doi: 10.1038/nbt.3838.

Don't let useful data go to waste.

Nature. 2017 Feb 28;543(7643):7. doi: 10.1038/543007a.

Whose sample is it anyway? Widespread misannotation of samples in transcriptomics studies.

F1000Res. 2016 Aug 30;5:2103. doi: 10.12688/f1000research.9471.2. eCollection 2016.

Flexible expressed region analysis for RNA-seq with derfinder.

Nucleic Acids Res. 2017 Jan 25;45(2):e9. doi: 10.1093/nar/gkw852. Epub 2016 Sep 29.

Rail-RNA: scalable analysis of RNA-seq splicing and coverage.

Bioinformatics. 2017 Dec 15;33(24):4033-4040. doi: 10.1093/bioinformatics/btw575.

A Landscape of Pharmacogenomic Interactions in Cancer.

Cell. 2016 Jul 28;166(3):740-754. doi: 10.1016/j.cell.2016.06.017. Epub 2016 Jul 7.

Twenty years and still counting: including women as participants and studying sex and gender in biomedical research.

BMC Womens Health. 2015 Oct 26;15:94. doi: 10.1186/s12905-015-0251-9.

Tumor-Derived Cell Lines as Molecular Models of Cancer Pharmacogenomics.

Mol Cancer Res. 2016 Jan;14(1):3-13. doi: 10.1158/1541-7786.MCR-15-0189. Epub 2015 Aug 6.

Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans.

Science. 2015 May 8;348(6235):648-60. doi: 10.1126/science.1262110. Epub 2015 May 7.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

通过表型预测提高公共 RNA-seq 表达数据的价值。

Improving the value of public RNA-seq expression data by phenotype prediction.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献