Rueda Manuel, Gut Ivo G
Centro Nacional de Análisis Genómico, C/Baldiri Reixac 4, 08028 Barcelona, Spain.
Universitat de Barcelona (UB), Barcelona, Spain.
medRxiv. 2025 Sep 7:2025.09.05.25335150. doi: 10.1101/2025.09.05.25335150.
In biomedical research, subjects and biospecimens are commonly tracked using simple IDs or UUIDs, which guarantee uniqueness but convey no embedded semantic information. Contextual metadata (such as tissue type, diagnosis, or assay) is often stored separately, making integration, cohort selection, and downstream analysis cumbersome. While structured barcoding systems exist in large consortia (e.g., TCGA, GTEx) or domain-specific contexts (e.g., SPREC, GOLD), no unified, extensible framework currently spans both subjects and biosamples in a human- and machine-readable way.
We developed ClarID, a domain-agnostic specification that supports two identifier formats: (i) a human-readable form (e.g., 'CNAG_Test-HomSap-00001-LIV-TUM-RNA-C22.0-TRT-P1W' that encodes key metadata such as project, species, subject_id, tissue, assay, disease, timepoint and duration (from that event); and (ii) a compact version named 'stub' (e.g., 'CT01001LTR0N401T1W') optimized for filenames, pipelines, and labeling.ClarID is implemented through an open-source command-line tool, ClarID-Tools, which processes tabular metadata files (CSV/TSV) and uses a YAML-based codebook to generate, decode, and validate identifiers, as well as to create and read QR codes. The tool supports bulk and single-sample processing and allows easy integration with institutional workflows.
To demonstrate ClarID's utility, we applied it to datasets from the Genomic Data Commons (GDC), generating interpretable identifiers for more than 113,000 clinical records (subjects) and 4,255 biospecimen records. All materials, including pre-processing scripts, input and encoded data, are publicly available and fully reproducible via the accompanying GitHub repository and Google Colab.
ClarID fills a critical gap between opaque accession numbers and rich metadata schemas by embedding key context directly into structured identifiers. It enhances traceability, facilitates downstream analysis, and remains adaptable to project-specific needs through a configurable codebook. The accompanying ClarID-Tools software is freely available, together with full documentation and reproducible pipelines, at https://github.com/CNAG-Biomedical-Informatics/clarid-tools.
在生物医学研究中,通常使用简单的标识符或通用唯一识别码(UUID)来跟踪受试者和生物样本,这些标识符保证了唯一性,但不包含任何嵌入式语义信息。上下文元数据(如组织类型、诊断结果或检测方法)通常单独存储,这使得整合、队列选择和下游分析变得繁琐。虽然大型联盟(如癌症基因组图谱(TCGA)、基因型组织表达(GTEx))或特定领域环境(如蛋白质组学标准计划(SPREC)、基因组在线数据库(GOLD))中存在结构化条形码系统,但目前尚无统一、可扩展的框架以人类可读和机器可读的方式涵盖受试者和生物样本。
我们开发了ClarID,这是一种与领域无关的规范,支持两种标识符格式:(i)人类可读形式(如“CNAG_Test-HomSap-00001-LIV-TUM-RNA-C22.0-TRT-P1W”,它编码了项目、物种、受试者ID、组织、检测方法、疾病、时间点和持续时间(从该事件开始)等关键元数据;(ii)一种名为“存根”的紧凑版本(如“CT01001LTR0N401T1W”),针对文件名、管道和标签进行了优化。ClarID通过一个开源命令行工具ClarID-Tools来实现,该工具处理表格元数据文件(CSV/TSV),并使用基于YAML的码本生成、解码和验证标识符,以及创建和读取二维码。该工具支持批量和单样本处理,并允许轻松集成到机构工作流程中。
为了证明ClarID的实用性,我们将其应用于来自基因组数据共享库(GDC)的数据集,为超过113,000条临床记录(受试者)和4,255条生物样本记录生成了可解释的标识符。所有材料,包括预处理脚本、输入和编码数据,均可通过随附的GitHub仓库和谷歌Colab公开获取并完全重现。
ClarID通过将关键上下文直接嵌入结构化标识符中,填补了不透明的入库编号和丰富元数据模式之间关键的空白。它增强了可追溯性,便于下游分析,并通过可配置的码本保持对项目特定需求的适应性。随附的ClarID-Tools软件可在https://github.com/CNAG-Biomedical-Informatics/clarid-tools上免费获取,同时还提供完整的文档和可重现的管道。