Department of Mathematics and Statistics, University of Turku, Turku, Finland.
Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
Sci Data. 2023 Jul 5;10(1):430. doi: 10.1038/s41597-023-02335-4.
Genomic and transcriptomic data have been generated across a wide range of prostate cancer (PCa) study cohorts. These data can be used to better characterize the molecular features associated with clinical outcomes and to test hypotheses across multiple, independent patient cohorts. In addition, derived features, such as estimates of cell composition, risk scores, and androgen receptor (AR) scores, can be used to develop novel hypotheses leveraging existing multi-omic datasets. The full potential of such data is yet to be realized as independent datasets exist in different repositories, have been processed using different pipelines, and derived and clinical features are often not provided or not standardized. Here, we present the curatedPCaData R package, a harmonized data resource representing >2900 primary tumor, >200 normal tissue, and >500 metastatic PCa samples across 19 datasets processed using standardized pipelines with updated gene annotations. We show that meta-analysis across harmonized studies has great potential for robust and clinically meaningful insights. curatedPCaData is an open and accessible community resource with code made available for reproducibility.
基因组和转录组数据已经在广泛的前列腺癌 (PCa) 研究队列中生成。这些数据可用于更好地描述与临床结果相关的分子特征,并在多个独立的患者队列中检验假设。此外,衍生特征,如细胞成分估计、风险评分和雄激素受体 (AR) 评分,可用于利用现有多组学数据集提出新的假设。由于独立数据集存在于不同的存储库中,使用不同的管道进行了处理,并且衍生的和临床特征通常未提供或未标准化,因此此类数据的全部潜力尚未实现。在这里,我们介绍了经过整理的 PCaData R 软件包,这是一个经过协调的数据资源,代表了来自 19 个数据集的 >2900 个原发性肿瘤、>200 个正常组织和 >500 个转移性 PCa 样本,这些数据集使用标准化管道进行了处理,并使用更新的基因注释进行了协调。我们表明,经过协调的研究的荟萃分析具有产生稳健且具有临床意义的见解的巨大潜力。 curatedPCaData 是一个开放且可访问的社区资源,提供了可重现性的代码。