Division of Medical Genetics, Department of Medicine, University of California San Diego, La Jolla, CA 92093, United States.
Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, United States.
Bioinformatics. 2024 Aug 2;40(8). doi: 10.1093/bioinformatics/btae489.
Harmonizing variant indexing and allele assignments across datasets is crucial for data integrity in cross-dataset studies such as multi-cohort genome-wide association studies, meta-analyses, and the development, validation, and application of polygenic risk scores. Ensuring this indexing and allele consistency is a laborious, time-consuming, and error-prone process requiring a certain degree of computational proficiency. Here, we introduce GRIEVOUS, a command-line tool for cross-dataset variant homogenization. By means of an internal database and a custom indexing methodology, GRIEVOUS identifies, formats, and aligns all biallelic single nucleotide polymorphisms (SNPs) across all summary statistic and genotype files of interest. Upon completion of dataset harmonization, GRIEVOUS can also be used to extract the maximal set of biallelic SNPs common to all datasets.
GRIEVOUS and all supporting documentation and tutorials can be found at https://github.com/jvtalwar/GRIEVOUS. It is freely and publicly available under the MIT license and can be installed via pip.
在跨数据集研究(如多队列全基因组关联研究、荟萃分析,以及多基因风险评分的开发、验证和应用)中,协调变体索引和等位基因赋值对于数据完整性至关重要。确保这种索引和等位基因一致性是一个繁琐、耗时且容易出错的过程,需要一定程度的计算能力。在这里,我们介绍了 GRIEVOUS,这是一种用于跨数据集变体同质化的命令行工具。通过内部数据库和自定义索引方法,GRIEVOUS 可以识别、格式化和对齐所有感兴趣的汇总统计和基因型文件中的所有双等位基因单核苷酸多态性(SNP)。完成数据集协调后,GRIEVOUS 还可用于提取所有数据集共有的最大双等位基因 SNP 集。
GRIEVOUS 及其所有支持文档和教程都可以在 https://github.com/jvtalwar/GRIEVOUS 上找到。它根据 MIT 许可证免费公开提供,并可通过 pip 进行安装。