Hung Ling-Hong, Fukuda Bryce, Schmitz Robert, Hoang Varik, Lloyd Wes, Yeung Ka Yee
School of Engineering and Technology, University of Washington Tacoma, Tacoma, Washington, USA.
Biodepot LLC, Seattle, Washington, USA.
PLoS One. 2025 Mar 4;20(3):e0318676. doi: 10.1371/journal.pone.0318676. eCollection 2025.
Cancer data is widely available in repositories such as the National Cancer Institute (NCI) Genomic Data Commons (GDC). These datasets could serve as controls or comparisons in compendium analyses with user data, avoiding the expense and time of generating additional datasets. However, the user must be able to process their new data in the same manner for these comparisons to be useful. This can be non-trivial. Although the executables themselves are usually available in repositories, the GDC pipelines that describe that entire analysis workflow are currently published as text-based standard operating procedures (SOPs). It is difficult to document a computational workflow to the level of detail and accuracy required to reproduce the results. Discrepancies between versions and exclusions of details accumulate as the documentation inevitably lags behind code revisions. Our goal is to enhance the utility of the GDC by converting the SOPs into an accessible and executable format. Specifically, we converted the GDC DNA sequencing (DNA-Seq) and the GDC mRNA sequencing (mRNA-Seq) SOPs into reproducible, self-installing, containerized, and interactive graphical workflows. These can be applied to reproducibly process user data and to harmonize datasets across repositories. Using our publicly available graphical workflows, we harmonize raw RNA-Seq datasets from the GDC and the Genotype-Tissue Expression (GTEx) project that were originally processed using different methodologies to illustrate the importance of uniform processing of control and treatment data for accurate inference of differentially expressed genes. By disseminating the analytical methodology in a reproducible and executable form, we greatly increase the utility of the GDC by enabling researchers to uniformly process custom data and datasets across multiple repositories to enhance data interpretation. Our approach and open-source executable workflows of making the analytical process as readily available as the data can be applied to other data repositories to increase their impact on scientific research.
癌症数据在诸如美国国立癌症研究所(NCI)基因组数据共享库(GDC)等资源库中广泛可用。这些数据集可作为与用户数据进行综合分析时的对照或比较对象,避免生成额外数据集的费用和时间。然而,为使这些比较有用,用户必须能够以相同方式处理其新数据。这并非易事。尽管可执行文件本身通常可在资源库中获取,但描述整个分析工作流程的GDC管道目前是以基于文本的标准操作规程(SOP)形式发布的。将计算工作流程记录到重现结果所需的详细程度和准确性水平很困难。随着文档不可避免地滞后于代码修订,版本之间的差异和细节的遗漏会不断累积。我们的目标是通过将SOP转换为可访问且可执行的格式来提高GDC的实用性。具体而言,我们将GDC DNA测序(DNA-Seq)和GDC mRNA测序(mRNA-Seq)的SOP转换为可重现、自安装、容器化且交互式的图形化工作流程。这些工作流程可用于以可重现的方式处理用户数据,并使跨资源库的数据集协调一致。使用我们公开可用的图形化工作流程,我们对来自GDC和基因型-组织表达(GTEx)项目的原始RNA-Seq数据集进行了协调,这些数据集最初是使用不同方法处理的,以说明对对照和处理数据进行统一处理对于准确推断差异表达基因的重要性。通过以可重现和可执行的形式传播分析方法,我们极大地提高了GDC的实用性,使研究人员能够统一处理来自多个资源库的自定义数据和数据集,以增强数据解释。我们使分析过程与数据一样易于获取的方法和开源可执行工作流程可应用于其他数据资源库,以增加它们对科学研究的影响。