GRIEVOUS：用于解决跨数据集基因型不一致问题的命令行通用工具。

GRIEVOUS: your command-line general for resolving cross-dataset genotype inconsistencies.

机构信息

Division of Medical Genetics, Department of Medicine, University of California San Diego, La Jolla, CA 92093, United States.

Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, United States.

出版信息

Bioinformatics. 2024 Aug 2;40(8). doi: 10.1093/bioinformatics/btae489.

DOI:10.1093/bioinformatics/btae489

PMID:39078222

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11322043/

Abstract

SUMMARY

Harmonizing variant indexing and allele assignments across datasets is crucial for data integrity in cross-dataset studies such as multi-cohort genome-wide association studies, meta-analyses, and the development, validation, and application of polygenic risk scores. Ensuring this indexing and allele consistency is a laborious, time-consuming, and error-prone process requiring a certain degree of computational proficiency. Here, we introduce GRIEVOUS, a command-line tool for cross-dataset variant homogenization. By means of an internal database and a custom indexing methodology, GRIEVOUS identifies, formats, and aligns all biallelic single nucleotide polymorphisms (SNPs) across all summary statistic and genotype files of interest. Upon completion of dataset harmonization, GRIEVOUS can also be used to extract the maximal set of biallelic SNPs common to all datasets.

AVAILABILITY AND IMPLEMENTATION

GRIEVOUS and all supporting documentation and tutorials can be found at https://github.com/jvtalwar/GRIEVOUS. It is freely and publicly available under the MIT license and can be installed via pip.

摘要

在跨数据集研究（如多队列全基因组关联研究、荟萃分析，以及多基因风险评分的开发、验证和应用）中，协调变体索引和等位基因赋值对于数据完整性至关重要。确保这种索引和等位基因一致性是一个繁琐、耗时且容易出错的过程，需要一定程度的计算能力。在这里，我们介绍了 GRIEVOUS，这是一种用于跨数据集变体同质化的命令行工具。通过内部数据库和自定义索引方法，GRIEVOUS 可以识别、格式化和对齐所有感兴趣的汇总统计和基因型文件中的所有双等位基因单核苷酸多态性（SNP）。完成数据集协调后，GRIEVOUS 还可用于提取所有数据集共有的最大双等位基因 SNP 集。