Stolarczyk Michał, Xue Bingjie, Sheffield Nathan C
Center for Public Health Genomics, University of Virginia, Virginia, 22908, USA.
NAR Genom Bioinform. 2021 May 14;3(2):lqab036. doi: 10.1093/nargab/lqab036. eCollection 2021 Jun.
Genome analysis relies on reference data like sequences, feature annotations, and aligner indexes. These data can be found in many versions from many sources, making it challenging to identify and assess compatibility among them. For example, how can you determine which indexes are derived from identical raw sequence files, or which annotations share a compatible coordinate system? Here, we describe a novel approach to establish identity and compatibility of reference genome resources. We approach this with three advances: first, we derive unique identifiers for each resource; second, we record parent-child relationships among resources; and third, we describe recursive identifiers that determine identity as well as compatibility of coordinate systems and sequence names. These advances facilitate portability, reproducibility, and re-use of genome reference data. https://refgenie.databio.org.
基因组分析依赖于诸如序列、特征注释和比对索引等参考数据。这些数据可以从许多来源找到许多版本,这使得识别和评估它们之间的兼容性具有挑战性。例如,你如何确定哪些索引来自相同的原始序列文件,或者哪些注释共享兼容的坐标系?在这里,我们描述了一种建立参考基因组资源的同一性和兼容性的新方法。我们通过三个进展来实现这一点:第一,我们为每个资源派生唯一标识符;第二,我们记录资源之间的父子关系;第三,我们描述递归标识符,这些标识符确定坐标系和序列名称的同一性以及兼容性。这些进展促进了基因组参考数据的可移植性、可重复性和再利用。https://refgenie.databio.org 。