• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ClarID:一种用于生物医学元数据集成的人类可读且紧凑的标识符规范。

ClarID: A Human-Readable and Compact Identifier Specification for Biomedical Metadata Integration.

作者信息

Rueda Manuel, Gut Ivo G

机构信息

Centro Nacional de Análisis Genómico, C/Baldiri Reixac 4, 08028 Barcelona, Spain.

Universitat de Barcelona (UB), Barcelona, Spain.

出版信息

medRxiv. 2025 Sep 7:2025.09.05.25335150. doi: 10.1101/2025.09.05.25335150.

DOI:10.1101/2025.09.05.25335150
PMID:40950469
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12424912/
Abstract

BACKGROUND

In biomedical research, subjects and biospecimens are commonly tracked using simple IDs or UUIDs, which guarantee uniqueness but convey no embedded semantic information. Contextual metadata (such as tissue type, diagnosis, or assay) is often stored separately, making integration, cohort selection, and downstream analysis cumbersome. While structured barcoding systems exist in large consortia (e.g., TCGA, GTEx) or domain-specific contexts (e.g., SPREC, GOLD), no unified, extensible framework currently spans both subjects and biosamples in a human- and machine-readable way.

METHODS

We developed ClarID, a domain-agnostic specification that supports two identifier formats: (i) a human-readable form (e.g., 'CNAG_Test-HomSap-00001-LIV-TUM-RNA-C22.0-TRT-P1W' that encodes key metadata such as project, species, subject_id, tissue, assay, disease, timepoint and duration (from that event); and (ii) a compact version named 'stub' (e.g., 'CT01001LTR0N401T1W') optimized for filenames, pipelines, and labeling.ClarID is implemented through an open-source command-line tool, ClarID-Tools, which processes tabular metadata files (CSV/TSV) and uses a YAML-based codebook to generate, decode, and validate identifiers, as well as to create and read QR codes. The tool supports bulk and single-sample processing and allows easy integration with institutional workflows.

RESULTS

To demonstrate ClarID's utility, we applied it to datasets from the Genomic Data Commons (GDC), generating interpretable identifiers for more than 113,000 clinical records (subjects) and 4,255 biospecimen records. All materials, including pre-processing scripts, input and encoded data, are publicly available and fully reproducible via the accompanying GitHub repository and Google Colab.

CONCLUSIONS

ClarID fills a critical gap between opaque accession numbers and rich metadata schemas by embedding key context directly into structured identifiers. It enhances traceability, facilitates downstream analysis, and remains adaptable to project-specific needs through a configurable codebook. The accompanying ClarID-Tools software is freely available, together with full documentation and reproducible pipelines, at https://github.com/CNAG-Biomedical-Informatics/clarid-tools.

摘要

背景

在生物医学研究中,通常使用简单的标识符或通用唯一识别码(UUID)来跟踪受试者和生物样本,这些标识符保证了唯一性,但不包含任何嵌入式语义信息。上下文元数据(如组织类型、诊断结果或检测方法)通常单独存储,这使得整合、队列选择和下游分析变得繁琐。虽然大型联盟(如癌症基因组图谱(TCGA)、基因型组织表达(GTEx))或特定领域环境(如蛋白质组学标准计划(SPREC)、基因组在线数据库(GOLD))中存在结构化条形码系统,但目前尚无统一、可扩展的框架以人类可读和机器可读的方式涵盖受试者和生物样本。

方法

我们开发了ClarID,这是一种与领域无关的规范,支持两种标识符格式:(i)人类可读形式(如“CNAG_Test-HomSap-00001-LIV-TUM-RNA-C22.0-TRT-P1W”,它编码了项目、物种、受试者ID、组织、检测方法、疾病、时间点和持续时间(从该事件开始)等关键元数据;(ii)一种名为“存根”的紧凑版本(如“CT01001LTR0N401T1W”),针对文件名、管道和标签进行了优化。ClarID通过一个开源命令行工具ClarID-Tools来实现,该工具处理表格元数据文件(CSV/TSV),并使用基于YAML的码本生成、解码和验证标识符,以及创建和读取二维码。该工具支持批量和单样本处理,并允许轻松集成到机构工作流程中。

结果

为了证明ClarID的实用性,我们将其应用于来自基因组数据共享库(GDC)的数据集,为超过113,000条临床记录(受试者)和4,255条生物样本记录生成了可解释的标识符。所有材料,包括预处理脚本、输入和编码数据,均可通过随附的GitHub仓库和谷歌Colab公开获取并完全重现。

结论

ClarID通过将关键上下文直接嵌入结构化标识符中,填补了不透明的入库编号和丰富元数据模式之间关键的空白。它增强了可追溯性,便于下游分析,并通过可配置的码本保持对项目特定需求的适应性。随附的ClarID-Tools软件可在https://github.com/CNAG-Biomedical-Informatics/clarid-tools上免费获取,同时还提供完整的文档和可重现的管道。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/58e5/12424912/c43e473c3374/nihpp-2025.09.05.25335150v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/58e5/12424912/c43e473c3374/nihpp-2025.09.05.25335150v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/58e5/12424912/c43e473c3374/nihpp-2025.09.05.25335150v1-f0001.jpg

相似文献

1
ClarID: A Human-Readable and Compact Identifier Specification for Biomedical Metadata Integration.ClarID:一种用于生物医学元数据集成的人类可读且紧凑的标识符规范。
medRxiv. 2025 Sep 7:2025.09.05.25335150. doi: 10.1101/2025.09.05.25335150.
2
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
3
Improving the FAIRness and Sustainability of the NHGRI Resources Ecosystem.提高国家人类基因组研究所资源生态系统的公平性和可持续性。
ArXiv. 2025 Aug 19:arXiv:2508.13498v1.
4
Aspects of Genetic Diversity, Host Specificity and Public Health Significance of Single-Celled Intestinal Parasites Commonly Observed in Humans and Mostly Referred to as 'Non-Pathogenic'.人类常见且大多被称为“非致病性”的单细胞肠道寄生虫的遗传多样性、宿主特异性及公共卫生意义
APMIS. 2025 Sep;133(9):e70036. doi: 10.1111/apm.70036.
5
MarkVCID cerebral small vessel consortium: I. Enrollment, clinical, fluid protocols.马克 VCID 脑小血管联盟:一、入组、临床、液体方案。
Alzheimers Dement. 2021 Apr;17(4):704-715. doi: 10.1002/alz.12215. Epub 2021 Jan 21.
6
Short-Term Memory Impairment短期记忆障碍
7
Elbow Fractures Overview肘部骨折概述
8
PDF Entity Annotation Tool (PEAT).PDF实体注释工具(PEAT)。
J Open Source Softw. 2025 Apr 8;10(108):5336. doi: 10.21105/joss.05336.
9
Impact of residual disease as a prognostic factor for survival in women with advanced epithelial ovarian cancer after primary surgery.原发性手术后晚期上皮性卵巢癌患者残留病灶对生存预后的影响。
Cochrane Database Syst Rev. 2022 Sep 26;9(9):CD015048. doi: 10.1002/14651858.CD015048.pub2.
10
A Cloud-Based Platform for Harmonized COVID-19 Data: Design and Implementation of the Rapid Acceleration of Diagnostics (RADx) Data Hub.一个用于统一新冠病毒疾病(COVID-19)数据的基于云的平台:诊断快速加速(RADx)数据中心的设计与实现
JMIR Public Health Surveill. 2025 Aug 20;11:e72677. doi: 10.2196/72677.

本文引用的文献

1
Pheno-Ranker: a toolkit for comparison of phenotypic data stored in GA4GH standards and beyond.Pheno-Ranker:用于比较存储在GA4GH标准及其他标准中的表型数据的工具包。
BMC Bioinformatics. 2024 Dec 4;25(1):373. doi: 10.1186/s12859-024-05993-2.
2
A Standardized Nomenclature Design for Systematic Referencing and Identification of Animal Cellular Material.用于动物细胞材料系统引用和识别的标准化命名设计。
Animals (Basel). 2024 May 23;14(11):1541. doi: 10.3390/ani14111541.
3
Convert-Pheno: A software toolkit for the interconversion of standard data models for phenotypic data.
Convert-Pheno:用于表型数据标准数据模型互转的软件工具包。
J Biomed Inform. 2024 Jan;149:104558. doi: 10.1016/j.jbi.2023.104558. Epub 2023 Nov 29.
4
Standardized naming of microbiome samples in Genomes OnLine Database.在基因组在线数据库中对微生物组样本进行标准化命名。
Database (Oxford). 2023 Feb 16;2023. doi: 10.1093/database/baad001.
5
Beacon v2 Reference Implementation: a toolkit to enable federated sharing of genomic and phenotypic data.Beacon v2 参考实现:一个用于实现基因组和表型数据联合共享的工具包。
Bioinformatics. 2022 Sep 30;38(19):4656-4657. doi: 10.1093/bioinformatics/btac568.
6
The GA4GH Phenopacket schema defines a computable representation of clinical data.全球基因组与健康联盟(GA4GH)表型数据包模式定义了临床数据的可计算表示形式。
Nat Biotechnol. 2022 Jun;40(6):817-820. doi: 10.1038/s41587-022-01357-4.
7
GA4GH: International policies and standards for data sharing across genomic research and healthcare.全球基因组与健康联盟(GA4GH):跨基因组研究与医疗保健领域数据共享的国际政策与标准。
Cell Genom. 2021 Nov 10;1(2). doi: 10.1016/j.xgen.2021.100029.
8
The GTEx Consortium atlas of genetic regulatory effects across human tissues.GTEx 联盟人类组织遗传调控效应图谱
Science. 2020 Sep 11;369(6509):1318-1330. doi: 10.1126/science.aaz1776.
9
BioSamples database: an updated sample metadata hub.BioSamples 数据库:更新的样本元数据中心。
Nucleic Acids Res. 2019 Jan 8;47(D1):D1172-D1178. doi: 10.1093/nar/gky1061.
10
A Standard Nomenclature for Referencing and Authentication of Pluripotent Stem Cells.人多能干细胞命名和鉴定的标准化命名法。
Stem Cell Reports. 2018 Jan 9;10(1):1-6. doi: 10.1016/j.stemcr.2017.12.002.