• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过可访问、交互式且支持云的工作流程,协调和整合美国国立癌症研究所基因组数据共享库。

Harmonizing and integrating the NCI Genomic Data Commons through accessible, interactive, and cloud-enabled workflows.

作者信息

Hung Ling-Hong, Fukuda Bryce, Schmitz Robert, Hoang Varik, Lloyd Wes, Yeung Ka Yee

机构信息

School of Engineering and Technology, University of Washington Tacoma, Tacoma, Washington, USA.

Biodepot LLC, Seattle, Washington, USA.

出版信息

PLoS One. 2025 Mar 4;20(3):e0318676. doi: 10.1371/journal.pone.0318676. eCollection 2025.

DOI:10.1371/journal.pone.0318676
PMID:40036210
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11878898/
Abstract

Cancer data is widely available in repositories such as the National Cancer Institute (NCI) Genomic Data Commons (GDC). These datasets could serve as controls or comparisons in compendium analyses with user data, avoiding the expense and time of generating additional datasets. However, the user must be able to process their new data in the same manner for these comparisons to be useful. This can be non-trivial. Although the executables themselves are usually available in repositories, the GDC pipelines that describe that entire analysis workflow are currently published as text-based standard operating procedures (SOPs). It is difficult to document a computational workflow to the level of detail and accuracy required to reproduce the results. Discrepancies between versions and exclusions of details accumulate as the documentation inevitably lags behind code revisions. Our goal is to enhance the utility of the GDC by converting the SOPs into an accessible and executable format. Specifically, we converted the GDC DNA sequencing (DNA-Seq) and the GDC mRNA sequencing (mRNA-Seq) SOPs into reproducible, self-installing, containerized, and interactive graphical workflows. These can be applied to reproducibly process user data and to harmonize datasets across repositories. Using our publicly available graphical workflows, we harmonize raw RNA-Seq datasets from the GDC and the Genotype-Tissue Expression (GTEx) project that were originally processed using different methodologies to illustrate the importance of uniform processing of control and treatment data for accurate inference of differentially expressed genes. By disseminating the analytical methodology in a reproducible and executable form, we greatly increase the utility of the GDC by enabling researchers to uniformly process custom data and datasets across multiple repositories to enhance data interpretation. Our approach and open-source executable workflows of making the analytical process as readily available as the data can be applied to other data repositories to increase their impact on scientific research.

摘要

癌症数据在诸如美国国立癌症研究所(NCI)基因组数据共享库(GDC)等资源库中广泛可用。这些数据集可作为与用户数据进行综合分析时的对照或比较对象,避免生成额外数据集的费用和时间。然而,为使这些比较有用,用户必须能够以相同方式处理其新数据。这并非易事。尽管可执行文件本身通常可在资源库中获取,但描述整个分析工作流程的GDC管道目前是以基于文本的标准操作规程(SOP)形式发布的。将计算工作流程记录到重现结果所需的详细程度和准确性水平很困难。随着文档不可避免地滞后于代码修订,版本之间的差异和细节的遗漏会不断累积。我们的目标是通过将SOP转换为可访问且可执行的格式来提高GDC的实用性。具体而言,我们将GDC DNA测序(DNA-Seq)和GDC mRNA测序(mRNA-Seq)的SOP转换为可重现、自安装、容器化且交互式的图形化工作流程。这些工作流程可用于以可重现的方式处理用户数据,并使跨资源库的数据集协调一致。使用我们公开可用的图形化工作流程,我们对来自GDC和基因型-组织表达(GTEx)项目的原始RNA-Seq数据集进行了协调,这些数据集最初是使用不同方法处理的,以说明对对照和处理数据进行统一处理对于准确推断差异表达基因的重要性。通过以可重现和可执行的形式传播分析方法,我们极大地提高了GDC的实用性,使研究人员能够统一处理来自多个资源库的自定义数据和数据集,以增强数据解释。我们使分析过程与数据一样易于获取的方法和开源可执行工作流程可应用于其他数据资源库,以增加它们对科学研究的影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0eda/11878898/511c1e5b4946/pone.0318676.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0eda/11878898/48863f8fc73b/pone.0318676.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0eda/11878898/e600b4081a51/pone.0318676.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0eda/11878898/8b2d5b645a67/pone.0318676.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0eda/11878898/738cdd524a6f/pone.0318676.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0eda/11878898/511c1e5b4946/pone.0318676.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0eda/11878898/48863f8fc73b/pone.0318676.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0eda/11878898/e600b4081a51/pone.0318676.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0eda/11878898/8b2d5b645a67/pone.0318676.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0eda/11878898/738cdd524a6f/pone.0318676.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0eda/11878898/511c1e5b4946/pone.0318676.g005.jpg

相似文献

1
Harmonizing and integrating the NCI Genomic Data Commons through accessible, interactive, and cloud-enabled workflows.通过可访问、交互式且支持云的工作流程,协调和整合美国国立癌症研究所基因组数据共享库。
PLoS One. 2025 Mar 4;20(3):e0318676. doi: 10.1371/journal.pone.0318676. eCollection 2025.
2
Uniform genomic data analysis in the NCI Genomic Data Commons.在 NCI 基因组数据共享中心进行统一的基因组数据分析。
Nat Commun. 2021 Feb 22;12(1):1226. doi: 10.1038/s41467-021-21254-9.
3
AnVILWorkflow: A runnable workflow package for Cloud-implemented bioinformatics analysis pipelines.AnVIL工作流程:一个用于云实现的生物信息学分析管道的可运行工作流程包。
F1000Res. 2024 Oct 21;13:1257. doi: 10.12688/f1000research.155449.1. eCollection 2024.
4
VDJServer: A Cloud-Based Analysis Portal and Data Commons for Immune Repertoire Sequences and Rearrangements.VDJServer:一个基于云的免疫受体序列和重排分析门户和数据公共库。
Front Immunol. 2018 May 8;9:976. doi: 10.3389/fimmu.2018.00976. eCollection 2018.
5
NCI Cancer Research Data Commons: Cloud-Based Analytic Resources.NCI 癌症研究数据共享:基于云的分析资源。
Cancer Res. 2024 May 2;84(9):1396-1403. doi: 10.1158/0008-5472.CAN-23-2657.
6
DolphinNext: a distributed data processing platform for high throughput genomics.海豚下一代:一个用于高通量基因组学的分布式数据处理平台。
BMC Genomics. 2020 Apr 19;21(1):310. doi: 10.1186/s12864-020-6714-x.
7
Developing Cancer Informatics Applications and Tools Using the NCI Genomic Data Commons API.使用美国国立癌症研究所基因组数据共享平台应用程序编程接口开发癌症信息学应用程序和工具。
Cancer Res. 2017 Nov 1;77(21):e15-e18. doi: 10.1158/0008-5472.CAN-17-0598.
8
A (fire)cloud-based DNA methylation data preprocessing and quality control platform.一个基于云计算的 DNA 甲基化数据预处理和质量控制平台。
BMC Bioinformatics. 2019 Mar 29;20(1):160. doi: 10.1186/s12859-019-2750-4.
9
Using the Seven Bridges Cancer Genomics Cloud to Access and Analyze Petabytes of Cancer Data.使用七桥癌症基因组学云平台访问和分析PB级癌症数据。
Curr Protoc Bioinformatics. 2017 Dec 8;60:11.16.1-11.16.32. doi: 10.1002/cpbi.39.
10
Cloud-enabled Biodepot workflow builder integrates image processing using Fiji with reproducible data analysis using Jupyter notebooks.启用云的 Biodepot 工作流生成器集成了使用 Fiji 进行图像处理以及使用 Jupyter 笔记本进行可重现数据分析。
Sci Rep. 2022 Sep 2;12(1):14920. doi: 10.1038/s41598-022-19173-w.

本文引用的文献

1
Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space.与美国国立人类基因组研究所(NHGRI)基因组数据科学分析、可视化和信息学实验室空间一起颠覆基因组学数据共享模式。
Cell Genom. 2022 Jan 12;2(1). doi: 10.1016/j.xgen.2021.100085. Epub 2022 Jan 13.
2
Uniform genomic data analysis in the NCI Genomic Data Commons.在 NCI 基因组数据共享中心进行统一的基因组数据分析。
Nat Commun. 2021 Feb 22;12(1):1226. doi: 10.1038/s41467-021-21254-9.
3
Twelve years of SAMtools and BCFtools.SAMtools 和 BCFtools 十二年。
Gigascience. 2021 Feb 16;10(2). doi: 10.1093/gigascience/giab008.
4
DolphinNext: a distributed data processing platform for high throughput genomics.海豚下一代:一个用于高通量基因组学的分布式数据处理平台。
BMC Genomics. 2020 Apr 19;21(1):310. doi: 10.1186/s12864-020-6714-x.
5
Variability in estimated gene expression among commonly used RNA-seq pipelines.常用 RNA-seq 分析流程中基因表达估计的变异性。
Sci Rep. 2020 Feb 17;10(1):2734. doi: 10.1038/s41598-020-59516-z.
6
Building Containerized Workflows Using the BioDepot-Workflow-Builder.使用 BioDepot-Workflow-Builder 构建容器化工作流程。
Cell Syst. 2019 Nov 27;9(5):508-514.e3. doi: 10.1016/j.cels.2019.08.007. Epub 2019 Sep 11.
7
Addendum: The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity.附录:癌细胞系百科全书可实现抗癌药物敏感性的预测建模。
Nature. 2019 Jan;565(7738):E5-E6. doi: 10.1038/s41586-018-0722-x.
8
GENCODE reference annotation for the human and mouse genomes.GENCODE 人类和小鼠基因组参考注释。
Nucleic Acids Res. 2019 Jan 8;47(D1):D766-D773. doi: 10.1093/nar/gky955.
9
The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update.Galaxy 平台:用于可访问、可重复和协作的生物医学分析:2018 年更新。
Nucleic Acids Res. 2018 Jul 2;46(W1):W537-W544. doi: 10.1093/nar/gky379.
10
Developing Cancer Informatics Applications and Tools Using the NCI Genomic Data Commons API.使用美国国立癌症研究所基因组数据共享平台应用程序编程接口开发癌症信息学应用程序和工具。
Cancer Res. 2017 Nov 1;77(21):e15-e18. doi: 10.1158/0008-5472.CAN-17-0598.